Matti Viherkoski 

Investigating Financial Drivers of ESG Scores 

An Interpretable Machine Learning Approach 

 
Vaasa 2025 

School of Finance and Accounting 
Master’s degree  

Finance 


2 

 
UNIVERSITY OF VAASA 
School of Finance and Accounting 

Author: Matti Viherkoski 
Title of the thesis:  Investigating Financial Drivers of ESG Scores: An Interpretable Ma-

chine Learning Approach 
Degree: Master of Science in Economics and Business Administration 
Degree Programme: Master’s Programme in Finance  
Supervisor: Timo Rothovius  
Year: 2025 Pages: 115 

ABSTRACT:  
Growing interest in sustainable finance has increased the demand for transparent and reproduc-
ible assessments of corporate sustainability. The study has been grounded in the idea that ESG 
ratings reflect both sustainability-related practices and potentially the economic capacity to dis-
close and implement them. This thesis examined the extent to which financial information ex-
plains variation in ESG ratings. The objective has been to assess how far ESG outcomes are pre-
dictable from firm-level characteristics, to identify the financial factors most consistently associ-
ated with them and to analyse the underlying structure of these relationships.  
 
The empirical analysis was based on data from firms included in the STOXX Europe 600 index for 
the period 2014–2023, obtained from the London Stock Exchange Group database. The dataset 
contained firm-level annual observations, and the dependent variables consisted of the overall 
ESG score and its environmental, social, and governance pillars. The independent variables com-
prised profitability, leverage, liquidity, efficiency, and valuation ratios, together with firm size, 
industry, and year identifiers. 
 
Methods combined supervised machine learning with model-agnostic interpretability. Model 
evaluation relied on standard regression metrics, and explainable artificial intelligence methods.  
 
The results indicated that firm-level financial characteristics explain a substantial portion of 
cross-sectional variation in ESG assessments. Nonlinear models outperformed linear alterna-
tives, demonstrating that relationships between financial and sustainability indicators are com-
plex and potentially interactive. The analysis highlighted firm size, operational efficiency, and 
capital structure as key predictors of higher ESG scores, whereas high profitability margins and 
liquidity were not systematically associated with higher assessed sustainability. 
 
The findings suggested that financial structure had influence over the measured sustainability, 
implying that ESG scores can partly reflect underlying economic fundamentals in addition to 
non-financial performance. The study showed that interpretable machine learning offered a 
practical framework for understanding these linkages, but also that financial data alone could 
not fully account for the multidimensional nature of sustainability. Future research was encour-
aged to integrate non-financial and textual data, apply longitudinal designs, and examine regu-
latory developments to better capture the dynamic relationship between corporate finance and 
sustainability outcomes. 
 
 
KEYWORDS: machine learning, modelling, responsible investing, sustainability reporting, cor-
porate responsibility 


3 

 
VAASAN YLIOPISTO 
Laskentatoimen ja rahoituksen yksikkö 

Tekijä: Matti Viherkoski 
Tutkielman nimi:  Investigating Financial Drivers of ESG Scores: An Interpretable Ma-

chine Learning Approach 
Tutkinto: Kauppatieteiden maisteri 
Oppiaine: Rahoitus 
Työn ohjaaja: Timo Rothovius  
Vuosi: 2025 Pages: 115 

TIIVISTELMÄ:  
Kestävän rahoituksen kasvava merkitys on lisännyt tarvetta läpinäkyville ja toistettaville yritys-
vastuuta kuvaaville arviointitavoille. Tutkimus on perustunut oletukseen, että ESG-arvosanat 
heijastavat sekä vastuullisuuskäytäntöjä, sekä mahdollisesti niiden raportointiin ja toimeenpa-
noon tarvittavaa taloudellista kapasiteettia. Tutkimuksen tavoitteena on ollut arvioida, kuinka 
pitkälle ESG-tuloksia voidaan ennustaa yritystason taloudellisista tunnusluvuista, tunnistaa 
mitkä taloudelliset mittarit linkittyvät näihin vahvimmin, ja analysoida näiden suhteiden raken-
teellista luonnetta.  
 
Empiirinen analyysi koostui STOXX Europe 600 -indeksin yrityksistä vuosilta 2014–2023 koot-
tuun aineistoon, joka on peräisin London Stock Exchange Groupin tietokannasta. Aineisto koos-
tui yritystason havainnoista vuosittain; riippuvina muuttujina olivat yritysten vastuullisuuspis-
teytykset, ja selittävinä muuttujina olivat kannattavuutta, velkaantuneisuutta, maksuvalmiutta, 
tehokkuutta ja arvostustasoa kuvaavat suhdeluvut, sekä yrityksen koko, toimialaluokitus ja vuo-
situnniste.  
 
Menetelmissä yhdistettiin ohjattua koneoppimista malliriippumattomaan tulkittavuuteen. Mal-
lien arviointi perustui standardoituihin regressiometriikoihin ja tuloksia selitettäviin tekoälyme-
netelmiin. 
 
Tulokset osoittivat, että yritystason taloudelliset ominaisuudet selittävät merkittävän osan ESG-
arviointien poikkileikkausvaihtelusta. Epälineaaristen mallien todettiin suoriutuvan lineaarisia 
vaihtoehtoja paremmin, osoittaen, että taloudellisten ja kestävyysindikaattorien väliset suhteet 
olivat monimutkaisia ja mahdollisesti vuorovaikutteisia. Analyysi korosti yrityksen kokoa, toimin-
nan tehokkuutta ja pääomarakennetta ESG-pisteiden keskeisinä ennustajina, kun taas korkeat 
kannattavuusmarginaalit ja likviditeetti eivät olleet systemaattisesti yhteydessä korkeampaan 
arvioituun vastuullisuustasoon. 
 
Tulokset osoittivat, että rahoitusrakenne vaikutti mitattuun vastuullisuuteen, mikä viittasi sii-
hen, että ESG-pisteet heijastivat osittain taustalla olevia talouden perustekijöitä ei-taloudellisen 
informaation lisäksi. Tutkimus osoitti, että tulkittava koneoppiminen tarjosi käytännöllisen vii-
tekehyksen näiden yhteyksien ymmärtämiseen, mutta pelkkä taloudellinen informaatio ei riitä 
selittämään kestävyyden luonnetta. Jatkotutkimuksiksi suositeltiin ei-taloudellisen ja tekstimuo-
toisen aineiston integrointia, pitkittäisasetelmien hyödyntämistä sekä sääntelykehitysten tar-
kastelua, jotta yritysrahoituksen ja kestävyystulosten dynaamista suhdetta voidaan kuvata täs-
mällisemmin. 
 
 
AVAINSANAT: koneoppiminen, mallintaminen, vastuullinen sijoittaminen, kestävyysrapor-
tointi, yritysvastuu 


4 

 
Contents  

1 Introduction 9 

1.1 Purpose and motivation 10 

1.2 Research hypotheses 11 

1.3 Structure of the study 11 

2 Literature review 13 

2.1 Corporate Sustainability and ESG Scores 13 

2.1.1 Defining Corporate Sustainability 13 

2.1.2 Importance of Sustainability for Companies 13 

2.1.3 Measuring Sustainability: ESG Scores and Other Metrics 14 

2.2 Predictive Modelling of ESG Scores 16 

2.2.1 Linear and Regularized Regression Models 17 

2.2.2 Tree-Based Ensemble Models (Random Forests and Boosting) 19 

2.2.3 Deep Learning Models (Neural Networks) 22 

3 Data and Data Processing 26 

3.1 Data Integrity and Initial Screening 29 

3.2 Currency Standardization 30 

3.3 Descriptive Statistics of the Processed Dataset 30 

3.3.1 Overview of ESG Scores 31 

3.3.2 Overview of Financial Variables 31 

3.3.3 Interpretation and Relevance for Modelling 32 

3.4 Distribution Across Time 32 

3.5 Industry Classification 34 

3.6 Correlogram 37 

3.7 Density Functions of ESG Scores 39 

3.8 Missing Data Assessment and Handling 40 

3.8.1 Extent and Distribution of Missingness 41 

3.8.2 Mechanisms of Missing Data 42 

3.8.3 Implications for Data Integrity and Modelling 42 

3.8.4 Treatment Principles 43 


5 

 
3.9 Pre-Modelling Procedures and Pipeline Design 43 

3.10 Winsorisation 45 

4 Methodology 47 

4.1 Machine Learning Models – general information on models used 47 

4.1.1 Random Forest 47 

4.1.2 XGBoost 48 

4.1.3 RIDGE and LASSO Regression 48 

4.2 Trained Models 49 

4.2.1 XGB1 49 

4.2.2 XGB2 51 

4.2.3 XGB3 52 

4.2.4 RF1 52 

4.2.5 RF2 52 

4.2.6 RF3 54 

4.2.7 RIDGE 55 

4.2.8 LASSO 55 

5 Results 56 

5.1 Observed and Residuals vs. Predicted ESG Score 58 

5.1.1 Observed vs. predicted plot 58 

5.1.2 Residuals vs. predicted plot 59 

5.1.3 Observed vs. Predicted ESG Score 59 

5.1.4 Residuals vs. Predicted ESG Score 61 

5.2 Feature Importances 62 

5.2.1 Permutation Importances 62 

5.2.2 Feature Importances by Weight, Cover and Gain 64 

5.2.2.1 Feature Importances by Weight 65 

5.2.2.2 Feature Importances by Cover 67 

5.2.2.3 Feature Importances by Gain 69 

5.2.2.4 Cross-metric feature comparison 71 


6 

 
5.3 SHAP Metrics 73 

5.3.1 Directional SHAP Feature Importance Metrics 73 

5.3.2 SHAP Summary Dot Plot 76 

5.4 Cross-validated Partial Dependence Plots 79 

5.5 SHAP Dependence plots and 2D partial dependence 83 

5.6 Further Analysis of Negative Directions 88 

6 Discussion 91 

6.1 Findings and Previous Literature 91 

6.2 Limitations 95 

6.2.1 Comparisons to Previous Literature 95 

6.2.2 Data and construct validity 96 

6.2.3 Study Design 97 

6.2.4 Feature Score and Omitted Variables 97 

6.2.5 Technical Constraints 97 

6.2.6 Interpretability Caveats 98 

6.2.7 External Validity and Regulatory Shifts 98 

6.3 Suggestions for Future Research 98 

Conclusion 101 

References 104 

Appendices 108 

Appendix 1. ESG Report/Rating Summary Table by Huber et al. (2017) 108 

Appendix 2. Results from other trained models 110 

 
7 

 
Figures 

Figure 1. Mean ESG Score by Year and Industry. 36 

Figure 2. Standard deviation of ESG Score by Year and Industry. 36 

Figure 3. Correlogram. 38 

Figure 4. Density functions of ESG score, E, S and G. 39 

Figure 5. XGB1 Observed vs predicted ESG score and residuals. 59 

Figure 6. Permutation Importances (XGB1). 63 

Figure 7. Feature Importance by Weight (XGB1). 66 

Figure 8. Feature Importance by Cover (XGB1). 68 

Figure 9. Feature Importance by Gain (XGB1). 70 

Figure 10. Directional SHAP feature importance Metric Bar (XGB1). 74 

Figure 11. SHAP Feature Importance Summary Dot (XGB1). 77 

Figure 12. XGB1 Cross-Validated PDPs for SIZE and TD/TA against ESG Score. 79 

Figure 13. XGB1 Cross-Validated PDPs for NS/TA and EBIT/NS against ESG Score. 80 

Figure 14. XGB1 Cross-Validated PDPs for DIV Y and ROE against ESG Score. 81 

Figure 15. XGB1 Cross-Validated PDPs for ROA against ESG Score. 82 

Figure 16. SHAP Plot and 2D Dependence for NS/TA and SIZE. 84 

Figure 17. SHAP Plot and 2D Dependence for DIV Y and SIZE. 85 

Figure 18. SHAP Plot and 2D Dependence for TD/TA and SIZE. 86 

Figure 19. SHAP Plot and 2D Dependence for EBIT/NS and SIZE. 86 

Figure 20. SHAP Plot and 2D Dependence for NS/TA and TD/TA. 87 

Figure 21. RF2 Observed vs predicted and residuals. 110 

Figure 22. RIDGE Obtained vs predicted and residuals. 110 

Figure 23. LASSO Obtained vs predicted and residuals. 111 

Figure 24.  RF2 Intrinsic Feature Importances. 111 

Figure 25. RF2 Permutation Importances. 112 

Figure 26. RF2 SHAP Feature Importances bar. 112 

Figure 27. RF2 SHAP Feature Importances plot. 113 

Figure 28. RF2 SHAP Directional metrics. 113 

Figure 29. RF2 Cross validated PDPs for ASSETS and TD/TA. 114 


8 

 
Figure 30. RF2 Cross validated PDPs for NS/TA and EBIT/NS. 114 

Figure 31. RF2 Cross validated PDPs for DIV Y and ROE. 114 

Figure 32. RF2 Cross validated PDPs for P/E and ROA. 115 

 
Tables 

Table 1. Summary of key studies. 25 

Table 2. Summary Statistics table. 30 

Table 3. Main statistics of the ESG, E, S and G scores distribution by year of the sample 

of 600 companies listed in the STOXX Europe 600 Index. 33 

Table 4. General Industry Classification explanation. 34 

Table 5. Main statistics of the ESG, E, S, and G score distribution by industry sector of the 

sample of 600 companies listed in the STOXX Europe 600 Index. 35 

Table 6. Density functions of ESG score, E, S and G. 40 

Table 7. Missing values per industry table. 41 

Table 8. Descriptions of XGBoost hyperparameters (XGBoost Developers, 2024). 51 

Table 9. Descriptions of RandomForest parameters (scikit-learn developers, 2024). 53 

Table 10. Model performance comparison. 56 

Table 11. ESG Report/Rating Summary Table by Huber et al. (2017). 108 

 
9 

 
1 Introduction 

The current investment landscape has been undergoing changes propelled by the grow-

ing demand and representation for Socially Responsible Investing (SRI) (D’Amato et al., 

2022). Alongside financial characteristics, SRI considers companies ethical, social, and 

environmental values, aiming to generate financial returns while considering positive 

outcomes for stakeholders from the point of view of sustainability. Arguably the most 

important characteristics and metrics for socially responsible investors are Environmen-

tal, Social and Corporate Governance (ESG) characteristics. Large financial data and rat-

ing institutions play pivotal roles in serving market participants benchmarks for guidance 

in their investment decision making processes (D’Amato et al., 2022). These institutions 

provide ESG ratings for companies, providing socially responsible investors with a quan-

tifiable tool to compare and analyze sustainability of companies. However, the accuracy 

and reliability of these ratings remain subjects of scrutiny. While ESG ratings are gaining 

traction, the accuracy of existing scores continues to be widely questioned, providing a 

need for further research and refinement in methodologies (Chowdhury et al., 2023).  

 
An obstacle for investors and policymakers is the inability to accurately evaluate the re-

liability of the aggregation process used to determine ESG scores. This challenge stems 

from the lack of transparency in the rating system. Rating agencies generate ESG scores 

using proprietary models, and the information available to the public is often limited to 

what the agency chooses to disclose. In many cases, this disclosure is restricted to the 

fundamental principles of the methodology, which varies between agencies. Conse-

quently, from the perspective of outside stakeholders, algorithms used by rating agen-

cies can be considered as black-box models, where the inner workings are obscure (Del 

Vitto et al., 2023).   

 
Furthermore, although different papers have found nonlinear relationships between 

sustainability metrics and their constituting indicators, Berg et al. (2022) discovered that 

six major ESG ratings are developed using linear models. These ratings rely on ad hoc 

weighted averages, meaning that the model weights assigned by the rater are assumed 


10 

 
to accurately reflect the relative importance of different ESG aspects. However, this ap-

proach overlooks the nuanced nature of ESG factors and may not fully capture their ac-

tual significance in sustainability assessment. Referring to Berg et al. (2022) findings, 

Svanberg et al. (2022) argue that because complex concepts such as ESG are not likely to 

have solely linear relationships with the features constituting ESG indicators, ESG ratings 

are unlikely to represent the degree of actual sustainability of companies.  

 
Further relating to the issue of ESG ratings accurately representing the degree of sus-

tainability, Billio et al. (2021) find that due to raters’ disagreements on characteristics 

and their significances defining components of ESG leads to varying sustainability assess-

ments among rating agencies and thus disseminates sustainable investors preferences 

regarding asset prices.  

 
1.1 Purpose and motivation 

Against this backdrop, the purpose of this thesis is to contribute to the evolving literature 

of sustainable finance and ESG investing by applying machine learning techniques to pre-

dict ESG scores using financial statement items, and analyzing the information learned 

by the model, to investigate the potential relationships between them. The results from 

investigating these relationships can improve the understanding of how and what finan-

cial characteristics affect the ESG scores of companies, and of the inner workings of the 

“black box models or methods” used at rating companies sustainability scores; the paper 

investigates whether, and to which extent a company’s ESG performance can be pre-

dicted using traditional financial statement items, and to examine which financial fea-

tures are most important in explaining ESG scores using Explainable AI tools, which are 

methods for interpreting machine learning models.  

 
Considering data and methodology, within previous literature this thesis has most simi-

larities with the work of Chowdhury et al. (2023), and D’Amato et al. (2021 and 2022), 

but uses combination of different variables that their studies previously recommended 


11 

 
as significant and uses larger dataset with more recent observations than, for example, 

D’Amato et al. (2021), covering years of Covid-19 pandemic. During recent years the im-

portance of sustainability has also kept rising, and LSEG’s ESG score methodology might 

have quietly changed, as the more specific methodology is not disclosed.  

 
1.2 Research hypotheses 

This thesis hypotheses that RandomForest and XGBoost machine learning methods can 

achieve notable reduction in variability of prediction than standard mean prediction 

would from simply using financial data on firm-year level observations, and that the in-

formation learned by the model can provide further insights on how ESG scores are af-

fected by different financial characteristics. If this hypothesis is true, it brings up ques-

tions about the nature of sustainability scores: To what degree do these scores accu-

rately capture sustainability of companies, considering that ESG ratings struggle with the 

problem of transparency of the models, and real sustainability effects of companies 

should at least in theory be rather independent from solely financial metrics? To what 

extent is it appropriate to compare ESG scores of different companies as a proxy for real 

sustainability on a continuous 0 – 100 scale, rather than on a form of weighted or ad-

justed scale, if companies with some financially homogenous metrics are consistently 

placed on different ESG score quantiles than others?  

 
1.3 Structure of the study 

In conclusion, in this thesis I aim to use supervised machine learning algorithms to pre-

dict ESG scores based on financial statement ratios and analyse the patterns learned by 

the models. By doing so, the study seeks to examine whether publicly available financial 

data can approximate ESG ratings and to what extent ESG assessments may be driven by 

quantifiable financial indicators.  

 
12 

 
The thesis is organized as follows: Chapter 2 provides a review of the existing literature 

regarding to sustainability, its impact on companies and how it can be measured, ESG 

ratings and predictive modelling approaches.  

 
Chapter 3 outlines the data used in the research, data processing and pre-processing 

steps and imputation methods.  

 
Chapter 4 outlines the methodology, covers the general information concerning the ma-

chine learning techniques used, and explains further model specific information consid-

ering the specific model data and calibration.  

 
Chapter 5 presents the results of the models, including their performance, and further 

explainable AI analysis of the best performing models learned patterns from the training 

dataset.  

 
Chapter 6 discusses the findings considering previous studies and evaluates the implica-

tions of model results and interpretability, discusses the limitations of the study, and 

suggests directions for future research.  

 
Finally, the last chapter concludes the study.  


13 

 
2 Literature review 

This chapter introduces key concepts and findings considering the topic from the litera-

ture. In this chapter, corporate sustainability is first defined and its importance shortly 

introduced with common metrics to measure sustainability. The chapter continues by 

reviewing previous literature considering predictive modelling of ESG scores and con-

cludes with summary table consisting of the key studies referred to in this chapter.  

 
2.1 Corporate Sustainability and ESG Scores 

The next subchapters define corporate sustainability, its importance for firms, and out-

lines ways companies present it.  

 
2.1.1 Defining Corporate Sustainability 

Corporate sustainability refers to a company’s ability to conduct business in a way that 

is environmentally sound, socially responsible, maintains transparent governance prin-

ciples, and sustains long term economic viability (Ahmad et al., 2024). In practice, this 

means integrating ecological integrity, social welfare, and good governance into corpo-

rate strategies while continuing creating value for shareholders. This concept aligns with 

the “triple bottom line” of people, planet, and profit, emphasizing that sustainable firms 

balance financial performance with social and environmental stewardship. According to 

the OECD, corporate sustainability entails embedding environmental and social consid-

erations into core business operations and strategy.  

 
2.1.2 Importance of Sustainability for Companies 

Companies are increasingly recognizing that strong sustainability practices can confer 

significant benefits. One key driver is investor demand: a growing share of global invest-

ments now incorporates Environmental, Social, and Governance (ESG) factors. As of the 

late 2010s, roughly $30 trillion in assets were managed using ESG criteria (D'Amato et 


14 

 
al., 2021). Investors increasingly view sustainability as linked to long-term financial per-

formance and effective risk management, prompting firms to improve ESG performance 

to attract capital. 

 
Research further suggests that companies with strong sustainability profiles may be 

more resilient during periods of crisis. For example, firms with high ESG ratings experi-

enced better stock return performance during the 2008 financial crisis (Lins et al. 2017), 

(D'Amato et al., 2021).  

 
Beyond investor considerations, Pilz (2024) suggests that sustainability also offers repu-

tational and operational advantages. Embracing ESG can enhance a company's brand 

and consumer trust, while also helping identify risks and opportunities within operations 

(Pilz, 2024).  

 
Pilz (2024) further suggests that sustainability initiatives can lead to cost savings like im-

proved energy efficiency and can spur innovation. Integrating ESG considerations also 

enables more informed decision-making and strengthens relationships with stakehold-

ers. Surveys of executives consistently show that sustainability is no longer seen as a 

niche issue, but as essential for long-term success and risk mitigation (Pilz, 2024). 

 
In summary, companies should care about sustainability not only for ethical reasons but 

also because it aligns with financial prudence, stakeholder expectations, and evolving 

regulatory trends in today’s business environment. 

 
2.1.3 Measuring Sustainability: ESG Scores and Other Metrics 

Corporate sustainability is commonly assessed using standardized measurement frame-

works, with ESG scores standing out as among the most widely adopted. ESG—an acro-

nym for Environmental, Social, and Governance—represents three dimensions used to 

evaluate a firm’s sustainability-related performance (Del Vitto et al., 2023). An ESG score 

serves as a summary indicator that reflects how effectively a company manages its risks 


15 

 
and externalities across these three domains. The environmental dimension typically in-

cludes metrics such as greenhouse gas emissions, resource consumption, and waste 

management; the social dimension covers issues such as labour practices, community 

relations, and product safety; and the governance dimension focuses on board structure, 

ethical conduct, and transparency (Del Vitto et al., 2023).  

 
These scores are generally produced by independent ESG rating agencies or data provid-

ers, which assess company disclosures, news sources, and other information to bench-

mark sustainability performance relative to industry peers. Leading providers include 

MSCI ESG Ratings, Sustainalytics, S&P Global (CSA/DJSI), and Refinitiv (formerly Thom-

son Reuters/Asset4), each applying distinct methodologies. For example, Refinitiv’s ESG 

framework evaluates over 12,000 companies and assigns percentile-based scores rang-

ing from 0 (lowest) to 100 (highest), based on industry-relative performance (Del Vitto 

et al., 2023). The expansion of ESG scoring systems reflects growing demand for quanti-

fiable sustainability metrics, and they have become a key tool for investors to quickly 

evaluate a company’s sustainable profiles (Del Vitto et al., 2023).  

 
It is important to recognize that ESG scores can vary substantially across rating providers 

due to differences in data sources, weighting schemes, and evaluation methodologies. 

Berg et al. (2022) documented significant divergence among ESG scores assigned by six 

leading rating agencies, underscoring the lack of standardization in sustainability assess-

ment practices. Nevertheless, ESG ratings remain widely used as a proxy for corporate 

sustainability performance in academic research and investment practice (D'Amato et al. 

2021). While these composite scores offer a convenient summary measure, firms often 

supplement them with more granular sustainability metrics for internal tracking and dis-

closure purposes.  

 
For further information considering largest ESG report providers and their rating meth-

ods, Huber et al. (2017) have constructed comprehensive summary table of the topic in 


16 

 
their paper “ESG Reports and Ratings: What They Are, Why They Matter”, that can also 

be found in this papers Appendix 1.  

 
In addition to ESG scores, several other frameworks and metrics are used to assess cor-

porate sustainability. Many companies publish sustainability reports in accordance with 

sustainability directives such as Corporate Sustainability Reporting Directive (CSRD) or  

European Sustainability Reporting Standards (ESRS), that follow established  

sustainability standards such as the Global Reporting Initiative (GRI), which mandate de-

tailed qualitative and quantitative disclosures. Other benchmarks include sustainability 

indices, such as the Dow Jones Sustainability Index (DJSI) and FTSE4Good, which rank 

companies based on structured questionnaires and performance criteria. Organizations 

may also pursue third-party certifications or ratings, such as B Corp certification or Car-

bon Disclosure Project (CDP) scores, particularly for environmental performance. 

 
Furthermore, concepts like Corporate Social Responsibility (CSR) and alignment with the 

UN Sustainable Development Goals (SDGs) are used to qualitatively gauge a company’s 

contributions to sustainable development. are often used to qualitatively evaluate a 

firm's contributions to sustainable development. These diverse measurement ap-

proaches complement ESG ratings. For example, a company may receive a high ESG score 

from MSCI, be included in the DJSI, and disclose its sustainability efforts in line with GRI 

standards—together offering a more holistic view of corporate sustainability. In this the-

sis, however, the primary focus is on ESG scores as a quantifiable measure of sustaina-

bility performance, given their widespread use in financial markets and research.  

 
2.2 Predictive Modelling of ESG Scores 

In recent years, a growing body of research at the intersection of sustainable finance and 

machine learning has focused on predicting ESG scores using a variety of data sources. 

The motivation for this work is twofold: first, to identify the factors that influence ESG 

ratings, thereby offering insight into the rating process and the relationship between 


17 

 
financial and sustainability performance; and second, to develop predictive models ca-

pable of estimating ESG scores where data are missing or of forecasting future ESG out-

comes, with potential applications for investors and corporate decision-makers. Lever-

aging the increasing availability of ESG ratings and firm-level financial data, researchers 

have employed a wide range of machine learning (ML) methods, ranging from linear re-

gressions to advanced deep learning architectures, to model ESG scores. The following 

review surveys recent literature on ESG score prediction, organizing studies by model 

type, from linear and regularized models to ensemble and deep learning approaches.  

 
2.2.1 Linear and Regularized Regression Models 

Since sustainability has become more topical in recent years, many studies have tried to 

make ESG scoring more transparent by building predictive models. A study by Licari, J. et 

al. (2021) used traditional linear regression to predict ESG scores across a large and 

global dataset consisting of 19,000+ companies in 96 countries between years 2004–

2020. The paper found that traditional models struggle to handle the complexity and 

inconsistency of ESG score construction, highlighted the limitations of traditional statis-

tical methods in modelling ESG ratings. In the paper, predicting ESG scores using liner 

regression achieved weak prediction performance – an R² of 31.13%, suggesting it cap-

tured only a small portion of the variation in ESG scores (31.13% of ESG score variation 

was attributable to the independent variables in the model). The paper presents multi-

ple potential reasons for poor performance of traditional models, including complex na-

ture of ESG rating methodologies, varying data sources, subjective weighting of ESG at-

tributes, direct company engagement, and coverage caps considering smaller firms, and 

varying regulation between sectors and emerging markets.  

 
Del Vitto, Marazzina, and Stocco (2023) investigate the transparency of proprietary ESG 

ratings by attempting to replicate the ESG scoring methodology used by LSEG (formerly 

Refinitiv). Using a combination of machine learning methods—including regularized lin-

ear models (Ridge and Lasso regressions), Random Forest, and Artificial Neural Net-

works—they model the Environmental, Social, and Governance (ESG) pillar scores based 


18 

 
on Refinitiv's full set of sustainability indicators and financial variables. A key contribu-

tion of their study is the demonstration that interpretable models such as Lasso and 

Ridge, often referred to as “white-box” methods, can achieve predictive performance 

comparable to more complex black-box models like neural networks. These linear mod-

els also offered the advantage of minimal overfitting and strong generalizability across 

sectors. The authors report high predictive accuracy for the Environmental pillar and 

moderate accuracy for the Social and Governance scores. The reduced accuracy for the 

social pillar is attributed to its broader and less quantifiable scope, while regional varia-

tion in Governance scores reflects differing institutional contexts and data availability—

prompting caution when making cross-country comparisons (e.g., between the U.S. and 

China). Their analysis also reveals that feature importance varies across industries and 

geographies, underscoring the contextual nature of ESG rating mechanisms. Overall, the 

findings suggest that a well-specified linear model using relevant financial and ESG indi-

cators can approximate Refinitiv’s ESG ratings with surprising accuracy, 

 
In a study of Taiwanese companies, Lin and Hsu (2023) included a multiple linear regres-

sion as a benchmark for ESG score prediction. The authors emphasized the importance 

of establishing interpretable baseline models, particularly in the context of Taiwan’s 

unique market characteristics, including technology-driven economy, limited stock circu-

lation, and heightened information asymmetry. Although they found that the linear 

models were consistently outperformed by more advanced machine learning techniques, 

they still demonstrated moderate predictive accuracy and served as a transparent refer-

ence point for evaluating more complex approaches. The authors noted that linear mod-

els struggled to capture the nonlinear relationships inherent in ESG ratings, especially in 

the presence of multicollinearity among financial and governance-related variables. 

Nonetheless, the inclusion of linear regression highlighted the trade-off between model 

simplicity and predictive power, underscoring its value in contexts where interpretability 

and transparency are prioritized.  

 
19 

 
Notably, linear models allow researchers to identify which financial ratios and indicators 

have the most explanatory power for ESG scores, albeit assuming a linear relationship. 

Commonly influential variables include profitability metrics, leverage, firm size, and in-

dustry-specific factors, which are consistent with broader empirical findings on the de-

terminants of ESG performance. Regularization techniques such as Lasso regression fur-

ther enhance model parsimony by shrinking the coefficients of less relevant predictors 

toward zero, thereby highlighting a core subset of explanatory features (Del Vitto et al., 

2023). While linear models generally exhibit lower predictive accuracy than nonlinear 

approaches in more complex environments, they nevertheless provide a transparent and 

reasonably effective baseline for ESG score modelling, particularly when interpretability 

and variable selection are of primary importance.  

 
2.2.2 Tree-Based Ensemble Models (Random Forests and Boosting) 

A significant portion of recent ESG prediction research employs tree-based ensemble 

models, including Random Forests (RF) and gradient boosting frameworks such as 

XGBoost and LightGBM. These models are suited to capture nonlinear relationships and 

complex interactions among predictors, making them effective in financial modelling 

contexts. They have likewise shown strong performance in predicting ESG scores across 

various studies. As they can handle high-dimensional data and model heterogeneity, 

they have become popular choice in studies aiming to replicate ESG scores or forecast 

sustainability performance. Moreover, ensemble models such as RF offer built-in mech-

anisms for estimating feature importance, which can provide insights into the relative 

contribution of predictors to ESG outcomes—albeit still with less transparency than lin-

ear models.  

 
Random Forests model was used by D’Amato, D’Ecclesia, & Levantesi (2021) in one of 

the pioneering works to link financial fundamentals with ESG ratings. Using data from 

109 STOXX Europe 600 index companies during the 2010s, the authors trained the model 

on balance sheet and income statement ratios to predict Bloomberg’s ESG disclosure 

scores. The study aimed to assess the predictive power of conventional financial 


20 

 
variables in explaining variation in sustainability ratings. Among the models tested, Ran-

dom Forest delivered the highest predictive performance, achieving an R2 of approxi-

mately 0.62, which outperformed linear regression and other baseline models. Key pre-

dictors identified included firm size, profitability, and leverage. The authors concluded 

that financial statement items constitute a robust explanatory basis for ESG scores, 

providing empirical support for the notion that sustainability assessments, although 

seemingly non-financial in nature, are linked to a firm’s financial characteristics.  

 
Complementing the findings of D’Amato et al. (2021), Lin et al. (2019) study had already 

found negative link between corporate social responsibility and corporate financial per-

formance measured by ROE, ROA and ROI, which supports the theory that a trade-off 

exists between optimising financial performance metrics and carrying out sustainability 

objectives. However, a few other studies have later pointed out that the trade-off only 

negatively affect companies which financial performance is below optimal to begin with.  

 
D’Amato et al. (2022) expand on their earlier study, aiming to assess structural data and 

balance sheet items effect on ESG scores of regularly traded stocks. In this study, they 

instead use Refinitiv (LSEG) ESG scores, with larger sample of companies across 2009 – 

2019 and find that balance sheet items present a significant predictive power on ESG 

score. Based on their findings the Random Forest algorithm performs best at predicting 

ESG scores compared to classical regression approach, as it can capture nonlinear rela-

tionships between ESG scores and predictive variables, which their study shows to occur 

consistently.   

 
Cini and Ferrari (2025) took this approach a step further by introducing a time dimension: 

they trained an RF classification model to predict a firm’s next-year ESG rating class using 

current financial ratios and risk indicators. Using panel data from 2016 to 2021 for Euro-

pean companies, their model categorized firms into ESG performance tiers (e.g., high, 

medium, or low) with high out-of-sample accuracy. This is notable as it demonstrates 

forward-looking predictive power – essentially showing that there is informational 


21 

 
content in financial fundamentals that anticipates improvements or declines in ESG per-

formance. The authors described their model’s accuracy as “unprecedented,” suggesting 

practical applications in estimating ESG ratings for firms that lack current evaluations, 

such as small-cap or privately held companies.  

 
Beyond Random Forests, boosting algorithms have also gained traction in ESG score pre-

diction due to their high predictive accuracy and ability to model complex nonlinear re-

lationships. Gradient boosting machines such as XGBoost have been applied in ESG stud-

ies with promising results. A study by Choi, Chen, & Lee (2024), compared multiple ML 

models on a dataset of Korean companies’ financial ratios over three years, aiming to 

predict the companies’ ESG ratings. They evaluated linear models, tree ensembles, and 

neural networks, and while applying SHAP (Shapley Additive Explanations) to interpret 

the variable importance. In their results, XGBoost was found to be the most effective 

model, achieving an F1-score of 85.1% in classifying ESG ratings.  

 
Similarly, Lin and Hsu (2023) included XGBoost in their evaluation of ESG prediction mod-

els for Taiwanese firms and found it to perform competitively, although an alternative 

model—Extreme Learning Machine (ELM)—slightly outperformed it in their dataset. 

However, the literature also cautions that particularly in ESG applications where datasets 

can consist of relatively small panels, boosting algorithms require careful hyperparame-

ter tuning to prevent overfitting.  

 
In summary, ensemble tree-based models have demonstrated strong predictive perfor-

mance in ESG score modelling. By capturing nonlinear relationships and complex feature 

interactions, methods such as Random Forest and XGBoost often outperform linear re-

gression models, which assume constant marginal effects. For instance, the impact of 

profitability on ESG scores may vary nonlinearly, strengthening or diminishing beyond 

certain thresholds. The collective evidence from recent studies suggests that these mod-

els can effectively learn the functional mapping between financial ratios and ESG ratings, 

with reported R2 values and classification metrics substantially exceeding baseline 


22 

 
accuracy (e.g., Choi et al., 2024; D’Amato et al., 2021). For more recent example, Alsay-

yad and Fadel (2025) findings demonstrated high predictive performance in their com-

prehensive machine learning study on ESG scores and employed panel data with best R2 

scores reaching over 0.9.  

 
The findings generally indicate that a considerable portion of the variance in ESG ratings 

can be explained by financial data. However, it should be noted that each study’s results 

depend on the specific dataset and ESG rating agency used, as each have unique meth-

odologies. Additionally, several studies point to diminishing returns: once a robust tree-

based model is in place, even more complex approaches may not dramatically improve 

accuracy, as we discuss next.  

 
2.2.3 Deep Learning Models (Neural Networks) 

Given the success of machine learning in predicting ESG scores, researchers have inves-

tigated deep learning approaches, such as multilayer artificial neural networks (ANNs), 

to see if they can further improve prediction performance.  Neural networks can, in the-

ory, capture very complex nonlinear interactions in data. However, in the context of ESG 

score prediction, deep learning has been explored less in comparison to tree-based mod-

els, and the empirical results are mixed.  

 
Del Vitto et al. (2023) evaluated multiple ANN architectures in their effort to replicate 

Refinitiv’s ESG scoring methodology. The authors tested both shallow and deep networks 

with varying the number of layers and hidden units, and benchmarked their perfor-

mance against simpler models, including Lasso regression and Random Forest. They 

found that increasing the depth and complexity of the neural networks did not consist-

ently improve prediction accuracy. In some cases, a simpler ANN with fewer hidden lay-

ers performed comparably to, or better than, more complex architectures.  Furthermore, 

the highest overall performance was achieved by the regularized linear models and the 

simpler ANN, rather than by the deeper ANNs or ensemble methods. These findings sug-

gest that while ESG–financial relationships are nonlinear, they may not require deep 


23 

 
architectures to model effectively.  This could be due to the moderate size of structured 

ESG datasets and the risk of overfitting when models include too many parameters rela-

tive to the data available (Del Vitto et al. (2023).  

 
Other studies reinforce the view that deep learning should be applied with caution in 

the context of ESG score prediction. Choi et al. (2024) included a neural network in their 

model comparison when classifying ESG ratings for Korean firms but ultimately found 

the tree-based XGBoost model superior.  

 
Lin and Hsu (2023) studied ESG score prediction of Taiwanese non-financial companies 

using 27 financial metrics with corporate governance indicators. The used an Extreme 

Learning Machine (ELM), a form of single-layer neural network with random weights and 

reported that ELM achieved excellent performance (R2 of over 0.9 for multiple models), 

slightly outperforming Random Forest and XGBoost in predicting ESG scores for their 

dataset. While ELM is technically a neural approach, it is not a deep learning model. It 

rather offers an efficient architecture for capturing nonlinearities in relatively small da-

tasets.  

 
These findings suggest that neural models—especially shallow or lightweight variants 

like ELM—can perform competitively in ESG prediction. However, evidence from recent 

studies indicates that deep neural networks have not consistently outperformed boost-

ing or ensemble tree methods when using structured financial data alone. Although hy-

brid deep learning approaches incorporating unstructured data such as ESG reports or 

news sentiment are gaining attention, they fall outside the scope of predictions based 

on structured data and are rather new topic in the research. As such, the incremental 

benefit of deep learning over more interpretable machine learning models remains lim-

ited in this domain, particularly given concerns around overfitting, data volume, and 

model transparency. An advantage of neural networks is their flexibility in integrating 

heterogeneous data sources, such as combining structured financial indicators with un-

structured information like textual disclosures or ESG news. However, this flexibility 


24 

 
comes at the cost of reduced model interpretability, which presents a limitation for sus-

tainability assessment. To address this concern, recent studies have increasingly em-

ployed explainable AI (XAI) techniques to interpret the internal logic of complex models. 

For example, both Del Vitto et al. (2023) and Choi et al. (2024) applied SHAP (Shapley 

Additive Explanations) to their ESG prediction models, enabling them to identify which 

input features, such as the debt-to-equity ratio, return on assets, or carbon emissions, 

had the greatest influence on predicting ESG scores.  

 
A related line of research explores the integration of natural language processing (NLP) 

and model robustness techniques. For example, Lee et al. (2022) proposed an AI frame-

work for predicting firm-specific ESG ratings by analysing governance and social-related 

datasets using a combination of machine learning and NLP algorithms. In addition to 

evaluating multiple models for prediction accuracy, their study addressed the vulnera-

bility of ESG systems to adversarial attacks, which they describe as malicious manipula-

tions of input data that can distort rating outcomes. They introduced a method for de-

tecting such attacks, contributing to the growing emphasis on data reliability and secu-

rity in ESG analytics. While such hybrid approaches extend beyond structured financial 

data and remain relatively novel, they showcase how new AI technology can be applied 

to expand the scope and resilience of ESG prediction models.  

 
The use of XAI contributes to making obscure models more transparent, which is im-

portant in the ESG domain where stakeholders seek to understand the drivers behind 

sustainability ratings. Insights from these interpretability tools also reinforce broader 

findings in the literature across both complex and simpler models, a relatively consistent 

set of financial variables frequently emerges as key predictors of ESG performance; prof-

itability, firm size, leverage, and industry-specific environmental or social factors are 

among the most cited drivers, hinting that certain financial fundamentals hold robust 

explanatory power across models and contexts.  

 
25 

 
In summary, while model performance varies, recent literature confirms that ESG scores 

can be predicted with reasonable accuracy using firm-level financial data and machine 

learning techniques. The table below summarizes key studies from this literature review, 

highlighting their methods, data sources, and key findings. 

 
Table 1. Summary of key studies.  

Study Data Methods Key Findings Add. Insights 

D’Amato et al. 

(2021) 

Euro Stoxx 600 

firms, Bloomberg 

ESG scores 

Random Forest vs. 

linear models 

RF achieved R2 

~0.62, indicating fi-

nancial metrics ex-

plain ESG scores 

Highlighted the im-

portance of struc-

tural financial data 

in ESG ratings.  

Del Vitto et al. 

(2023) 

Refinitiv ESG scores, 

global firms by sec-

tor 

Lasso, Ridge, Deci-

sion Tree, Random 

Forest, ANN 

Lasso and a shallow 

ANN were best pre-

dictors of ESG; 

deeper ANNs didn’t 

improve much 

ESG ratings can be 

largely replicated 

with a selected fea-

ture set and ML 

models  

Lin & Hsu 

(2023) 

Taiwan companies, 

ESG index scores 

(2018–2021) 

SVM, Random For-

est, XGBoost, Ex-

treme Learning Ma-

chine (ELM) 

High accuracy 

(~0.9+) R2 with dif-

ferent models 

Integrating financial 

and governance in-

dicators is effective 

Chowdhury et 

al. (2023) 

6171 firms from 

2005 to 2019 

Six machine learn-

ing classification 

models 

The RFC model was 

superior with 

78.50% accuracy 

 findings highlight 

the relationship be-

tween firm size, li-

quidity, and ESG in-

vesting. 

Choi et al. 

(2024) 

Korean firms, ESG 

ratings from a local 

agency 

Multiple (linear, RF, 

XGBoost, deep NN) 

+ XAI 

XGBoost was best 

(F1 ~85%), beating 

deep neural nets 

Financial factors 

(leverage, profitabil-

ity) were significant 

Cini & Ferrari 

(2025) 

Euro Stoxx 600 

(2016–2021), MSCI 

(or similar) ESG rat-

ing classes 

Random Forest clas-

sification to predict 

next-year ESG rating 

class from current 

financial ratios + a 

systemic risk metric 

High out-of-sample 

accuracy in classify-

ing ESG ratings one 

year ahead 

Investors can fore-

cast sustainability 

improvements or 

deteriorations using 

financial data 

 
26 

 
3 Data and Data Processing 

Data for this study is gathered from LSEG database. LSEG is one of the largest and most 

important financial data and ESG score providers, covering more than 80% of the global 

market capitalization (LSEG.com). LSEG ESG ratings are percentile rank scores, ranging 

from 0 (lowest) to 100 (highest). These ratings aim to objectively assess a company's 

relative ESG performance. LSEG states that their ESG ratings are data-driven and consider 

the most crucial industry metrics and are adjusted to consider biases related to trans-

parency and market capitalization. However, their scores are not exempt from some of 

the main problems with ESG scores. 

 
The original dataset consists of six hundred European companies from StoxxEurope600 

index from years 2014 to 2023. Original variables are selected based on suggestions and 

previous findings within the literature, based on Chowdhury et al. (2023), D’Amato et al. 

(2021) and D’Amato et al. (2022). D’Amato et al. (2021) convincingly argue that using 

ratios that consider the overall financial statements of the companies representing prof-

itability, liquidity and solvency ratios is more informative and improves the characteriza-

tion of companies, in contrast to using absolute financial statement values when aiming 

to explain ESG scores, which is why in this paper, the step of testing model characteristics 

importances using solely financial statement values is skipped, and financial statement 

ratios are used instead.  

 
After testing for variable correlations and initial model performances some potentially 

influential variables originally recommended in the literature, such as NI/NS (Net income 

/ Net Sales) were redacted from the final dataset and model, leaving EBIT/NS as the main 

proxy for profitability due to having proportionally higher prediction power in the mod-

els whilst having over 96% correlation with NI/NS.  

 
After omitting variables based on initial model performances and variable correlation, 

the following is a list of variables used in the models:  


27 

 
• YEAR: 2014-2023 

• INDUSTRY: General industry sector classification variable (range: 1-6). They are 

transformed into dummy variables in the model.  

• ESGScore: ESG score from LSEG database. 

• ESGE: Environmental score from LSEG. 

• ESGS: Social score from LSEG. 

• ESGG: Governance score from LSEG. 

• ASSETS: Total assets of the company.  

• SIZE: Logarithm of assets.  

• NS/TA: Net Sales divided by Total Assets.  

o Efficiency ratio (turnover). 

o Measures how efficiently a company uses its assets to generate sales. 

o Indicates operational efficiency and asset utilization; a higher ratio indi-

cates more effective use of assets to drive revenue.  

• EBIT/NS: The ratio of Earnings Before Interest and Taxes to Net Sales. 

o Profitability Ratio (Operating Margin) 

o The proportion of sales remaining as operating profit before accounting 

for interest and taxes. 

o Compares operational performance across companies regardless of their 

financing and tax structures. 

• DIV: Dividend yield.  

o Income (yield) ratio.  

o The annual dividend per share relative to the stock price. 

o Indicates the cash return on investment and can reflect the company's 

commitment to returning profits to shareholders.  

• P/E: Price to Earnings ratio. 

o Valuation Ratio. 

o The market value of a stock relative to its earnings per share. 

o Provides insights into market expectations and relative valuation. 

• CA/CL: Current Assets to Current Liabilities.  


28 

 
o Liquidity Ratio (working capital). 

o The ability of a company to cover its short-term liabilities with its short-

term assets. 

o Serves as an indicator of short-term financial health and liquidity. 

• TD/TA: Total Debt to Total Assets. 

o Solvency (Leverage) Ratio. 

o The proportion of a company’s assets financed by debt. 

o Evaluates financial risk and solvency; a lower ratio typically implies a more 

conservative capital structure. 

• ROE: Return on Equity. 

o Profitability Ratio. 

o The profitability relative to shareholders’ equity, reflecting how effec-

tively a company uses equity capital to generate profits. 

o Reflects management effectiveness and overall profitability relative to 

equity; critical for comparing performance among companies in the same 

industry. 

• ROA: Return on Assets. 

o Profitability Ratio (with an efficiency component). 

o How effectively a company generates profit from its total asset base. 

o View of operational efficiency and profitability, valuable for comparing 

companies irrespective of their financing structures.  

• P/E missing flag 

o Categorical variable included in the model where P/E was missing.  

• CA/CL missing flag 

o Categorical variable included in the model where CA/CL was missing. 

 
Processed data used in the models consists of ESG score as the main outcome variable, 

and Environmental, Social and Governmental scores separately, as other dependent var-

iables. Independent variables consist of general industry classification from 1 to 6, Year, 

Total Assets, Dividend Yield, ROE, ROA and categorical flags for missing P/E and CA/CL 


29 

 
ratios, and financial ratios of Net Sales / Total Assets, EBIT / Total Assets, EBIT / Net Sales, 

Price / Earnings, Current Assets / Current Liabilities and Total Debt / Total Assets.  

 
Similarly to D’Amato et al. (2021) in this paper “Year” is included as a separate static 

variable and disregards the year-on-year changes. In the model years 2014 to 2023 are 

considered, although for year 2023, most of the ESG scores were missing at the time of 

obtaining the dataset.   

 
Chowdhury et al. (2023) argues that based on variable importance factor, the Lagged 

ESG score is the most important predictor of ESG, followed by firm size and debt to eq-

uity ratio, indicating that previous investments into ESG, firms’ total assets and financial 

leverage are the best predictors of ESG score. In testing and model optimisation process 

of this paper, Lagged ESG score was also found to be the most important predictor of 

ESG scores based on variable importances. However, it can be reasoned that this finding 

is rather obvious, and the variable alone could explain most of the ESG score in the model 

testing phase, dominating the model and results. As the objective of this paper is to aim 

to predict ESG scores from financial statement items and reveal information about the 

underlying influence of financial statement items on ESG scores, including a lagged pre-

dictor of the dependant variable itself as independent variable rather works to defeats 

this purpose, and thus in this paper it is omitted from the models.  

 
3.1 Data Integrity and Initial Screening 

The dataset has a panel-like structure, containing annual observations for STOXX Europe 

600 companies from 2014 to 2023, but is analysed cross-sectionally. The dataset con-

tains no duplicate values. As first step of the data processing, observations with missing 

values in ESG Scores were omitted to ensure consistency of the dependent variables. The 

remaining dataset contains 4937 valid firm-year observations and is described in the fol-

lowing tables in this chapter.   


30 

 
3.2 Currency Standardization 

The financial data in the original dataset varied in currency, showcasing different firms’ 

financial information in their local currencies, containing observations with eight differ-

ent currencies: United Kingdom Pound (GBP), Euro (EUR), Danish Krone (DKK), Swiss 

Franc (CHF), Hong Kong Dollar (HKD), Norwegian Krone (NOK), Swedish Krona (SEK) and 

Polish Zloty (PLN). Since most of the variables in the model are ratios, considering com-

patible values between firms with different currencies only affected total assets. Assets 

were converted to EUR using the average annual EUR exchange rate corresponding to 

each observation’s year to ensure comparability across firms and time.  

 
3.3 Descriptive Statistics of the Processed Dataset 

Table 2. Summary Statistics table. 

 ESG 

Score 

ESG 

E 

ESG 

S 

ESG 

G 

AS-

SETS 

EBIT/

NS 

TD/

TA 

NS/

TA 

CA/

CL 

DIV 

Y 

P/E ROE ROA 

count 4937 4937 4937 4937 4935 4901 4935 4934 3807 4916 4562 4887 4870 

mean 65.37 64.4

3 

68.8

4 

61.5

2 

853755

84 

0.23 0.25 0.67 1.60 2.72 38.7

6 

17.5

3 

6.89 

std 17.28 23.4

4 

19.8

7 

20.9

2 

342680

815 

1.02 0.16 0.58 1.32 2.67 206.

03 

65.7

6 

12.35 

min 2.60 0.00 0.25 1.45 36571 -28.28 0.00 -0.05 0.21 0.00 0.20 -

262.

32 

-63.72 

25% 54.96 49.4

4 

56.8

3 

46.9

1 

394565

0 

0.07 0.13 0.23 0.99 1.18 12.5

0 

7.46 2.05 

50% 68.55 69.6

9 

73.4

3 

64.9

5 

103953

92 

0.13 0.24 0.58 1.31 2.35 19.0

0 

13.2

4 

5.38 

75% 78.45 83.2

0 

84.4

1 

78.4

2 

403248

22 

0.23 0.35 0.90 1.80 3.92 28.2

0 

20.9

1 

9.18 

max 95.72 99.1

4 

98.2

0 

98.5

6 

663919

8547 

29.13 1.32 4.41 29.2

7 

111.

23 

8105

.00 

2409

.86 

269.11 

 
31 

 
The descriptive statistics of all variables are presented in Table 2, summarizing the cen-

tral tendency and dispersion for ESG scores and the associated financial statement items. 

 
3.3.1 Overview of ESG Scores 

The mean ESG Score of the sample is approximately 65.4 with a standard deviation of 

17.3, indicating that most companies cluster around mid-to-high ESG performance levels. 

The environmental (E), social (S), and governance (G) pillars show comparable patterns: 

the S-score has the highest mean (68.8) and slightly lower dispersion, suggesting more 

consistent social-performance evaluations across firms, while the G-score shows the 

greatest variability (SD ≈ 20.9), potentially indicating broader differences in corporate 

governance practices across Europe. The overall range of scores (minimum ≈ 2.6, maxi-

mum ≈ 95.7) shows that the dataset contains both low- and high-performing firms, of-

fering variation for predictive modelling.  

 
The time-series analysis presented later in Table 3 (Section 3.4) reveals an upward trend 

in ESG means and a gradual reduction in standard deviations over the period 2014–2023. 

This pattern is consistent with the increasing institutional emphasis on sustainability re-

porting and improved data coverage in Europe, which have led to more homogeneous 

ESG assessments in recent years.  

 
3.3.2 Overview of Financial Variables 

The financial ratios display considerable heterogeneity, which is expected given the 

cross-industry composition of the sample. Total assets (ASSETS) vary widely, from 

roughly hundreds of thousands to over €6 billion, reflecting the coexistence of smaller 

firms and multinational corporations. Ratios such as EBIT/NS (mean = 0.23, SD = 1.02) 

and ROE (mean = 17.5, SD = 65.8) show substantial dispersion, partly driven by outliers 

in profitability and capital structure. This variation underscores the need for robust algo-

rithms and outlier treatment during modelling. 

 
32 

 
Leverage-related ratios, such as Total Debt to Total Assets (TD/TA), display relatively 

moderate variation (mean = 0.25, SD = 0.16), suggesting that debt levels among listed 

European firms are somewhat stable across industries. In contrast, CA/CL exhibits wide 

variation (mean = 1.6, SD = 1.3) where observed, reflecting differences in liquidity struc-

tures, especially between manufacturing firms and financial institutions. 

 
The P/E ratio demonstrates extreme spread (mean ≈ 38.8, SD ≈ 206.0, max > 8,000), 

indicating the presence of a few extraordinarily high values. Such dispersion arises from 

low or near-zero earnings denominators, highlighting an example reason for winsorising. 

Missing or undefined P/E values need to be handled carefully, which will be further dis-

cussed in missing-data assessment and the subsequent Imputation chapter. 

 
3.3.3 Interpretation and Relevance for Modelling 

The descriptive analysis highlights some considerations for modelling. The dataset is suf-

ficiently diverse with ranging firm sizes, profitability levels and sustainability outcomes 

for aiming to find relationships between financial statement items and ESG scores.  

 
The magnitudes and dispersion of variables indicate a need for data processing choices, 

such as winsorisation of extreme values, standardization for linear algorithms, and con-

text-specific handling of missing data.  

 
3.4 Distribution Across Time 

Table 3 presents the mean and standard deviation of the ESG score and its subcompo-

nents for years 2014 to 2023. The results show an upward trend in average scores over 

the period, accompanied by a gradual reduction in dispersion. The mean overall ESG 

score rises from approximately 58 in 2014 to about 70 in 2021–2022, while standard 

deviations decline from around 19 to 14. Similar dynamics are observed across the three 

pillars, although the magnitude and pace of change vary slightly: the Environmental (E) 


33 

 
and Social (S) dimensions increase more steadily than Governance (G), which remains 

comparatively volatile.  

 
Table 3. Main statistics of the ESG, E, S and G scores distribution by year of the sample of 600 
companies listed in the STOXX Europe 600 Index.  

 ESG 

Score 

   
E Score 

   
S score 

   
G score 

 
Year Mean SD  Mean SD  Mean SD  Mean SD 

2014 58.52 19.46  61.24 25.34  60.34 23.11  54.76 22.09 

2015 60.28 19.57  62.19 25.19  63.52 22.66  55.48 22.32 

2016 61.90 18.23  63.74 23.92  65.85 21.57  56.29 21.80 

2017 63.37 17.55  63.48 24.28  68.69 19.90  57.08 21.62 

2018 64.79 17.53  60.82 25.47  69.26 19.49  61.08 21.05 

2019 66.86 16.23  64.25 23.52  70.58 18.73  63.42 19.71 

2020 69.30 15.33  66.00 22.28  72.22 17.46  67.43 18.66 

2021 70.03 14.44  67.74 20.92  73.00 16.64  67.42 18.26 

2022 70.02 13.98  68.86 19.67  73.00 16.67  66.39 18.38 

2023 67.17 13.39  65.46 20.27  68.51 16.50  66.21 18.48 

 
The pattern indicates that firms in the STOXX Europe 600 have generally improved their 

reported sustainability performance during the past decade. The concurrent decline in 

standard deviations suggests convergence among firms, meaning that extreme low per-

formers have become less frequent while mid-range and high-range scores have become 

more typical. This aligns with the general observations within sustainability literature, 

relating to the increasing importance of sustainability, especially in Europe. Whilst, ac-

cording to LSEG, the ESG scores by LSEG do include country specific characteristics, in 

general they are still comparable to ESG scores within other continents, and Europe is 

pioneering as a market in sustainability, in relation to other markets.  

 
Over the years, ESG reporting has benefited from enhanced disclosure requirements and 

more consistent evaluation frameworks (LSEG), which may also reflect the gradual im-

provement in the underlying data infrastructure. The increase in mean scores and the 

narrowing of their spread can both result from a combination of progress in corporate 


34 

 
sustainability and a methodological maturation of ESG measurement. The pattern sug-

gests that the dataset captures broader systemic changes in European sustainability re-

porting and corporate sustainability in addition to firm-level differences.  

 
3.5 Industry Classification 

To capture cross-sectoral patterns in sustainability performance, the dataset is divided 

into six broad industry groups based on the general business classification used through-

out this thesis. The classifications and their definitions are presented in Table 4. 

 
Table 4. General Industry Classification explanation.  

1. Various Industries This category is a “catch-all” category for companies operating across 

different sectors, contains a diverging range of industries, including 

personal goods, pharmaceuticals and biotechnology, technology hard-

ware and equipment, food producers, retail, oil, gas, coal, aerospace, 

etc. 

2. Electricity and Tele-

communications 

Companies involved in providing electricity, telecommunications, etc. 

utilities 

3. Transportation, Travel, 

Leisure 

Companies involved in transportation services, travel, leisure, and re-

lated industries such as tourism. 

4. Banks Banks and financial institutions engaged in commercial banking activi-

ties. 

5. Insurance Companies operating under insurance industry. 

6. Real Estate and Invest-

ments 

Companies involved in real estate, investment management, invest-

ment banking, etc. 

 
The mean and standard deviation of ESG, E, S, and G scores by industry are summarized 

in Table 5 below. 

 
35 

 
Table 5. Main statistics of the ESG, E, S, and G score distribution by industry sector of the sample 
of 600 companies listed in the STOXX Europe 600 Index.  

  ESG score     E score     S score     G score   

Sec-
tor 

Mean SD   Mean SD   Mean SD   Mean SD 

1 
65.85 17.02 

 
63.25 23.06 

 
70.37 19.69 

 
61.16 21.04 

2 
69.28 14.72 

 
69.64 19.65 

 
70.98 19.02 

 
64.63 17.14 

3 
63.53 15.17 

 
64.44 19.49 

 
67.49 16.88 

 
58.58 22.10 

4 
66.54 17.26 

 
73.92 22.02 

 
69.41 18.12 

 
62.02 21.16 

5 
64.44 16.05 

 
65.18 23.51 

 
62.08 18.42 

 
69.32 18.50 

6 
58.04 20.44 

 
58.80 28.06 

 
59.44 21.68 

 
56.30 22.34 

 
Across industries, ESG performance varies notably both in mean values and dispersion. 

Firms in Electricity and Telecommunications (Industry 2) show the highest average ESG 

and E-scores, consistent with the sector’s strong exposure to environmental regulation 

and renewable energy transition policies. Conversely, Real Estate and Investment firms 

(Industry 6) record the lowest overall ESG averages and the widest dispersion, reflecting 

structural heterogeneity and differing reporting standards within that category. 

 
Banks (Industry 4) exhibit relatively high Environmental (E) scores compared to other 

sectors, which may stem from their lower direct emissions and increasing engagement 

in sustainable finance. In contrast, Insurance (Industry 5) firms tend to perform better in 

Governance (G), likely due to regulatory oversight and mature compliance systems. 

These cross-industry contrasts confirm that sustainability outcomes are influenced not 

only by firm-level financial characteristics but also by sector-specific operational and reg-

ulatory contexts. 

 
From a modelling perspective, such heterogeneity underscores the importance of includ-

ing industry identifiers or fixed effects when predicting ESG outcomes. Industry-level var-

iation can capture systematic differences in disclosure norms, business models, and risk 

exposures. The implications of industry structure for feature importance and model be-

haviour will be revisited in later chapters, where variable importance and interpretability 

methods (e.g., SHAP values and partial dependence plots) are discussed.  


36 

 
Figure 1. Mean ESG Score by Year and Industry. 

 
Figure 2. Standard deviation of ESG Score by Year and Industry. 

 
37 

 
3.6 Correlogram 

To examine the relationships among variables and assess potential redundancy among 

predictors, pairwise correlations were computed using Pearson’s correlation coefficient 

(r). Pearson’s r quantifies the linear association between two continuous variables, rang-

ing from –1 to +1, where values near ±1 indicate a strong linear relationship, and values 

close to zero imply weak or no correlation. Correlation analysis provides a diagnostic step 

for identifying potential multicollinearity, which can affect model stability and interpret-

ability — particularly for linear estimators. 

 
It should be noted that the variables included in this correlation matrix represent the 

final selected features after preliminary testing. During the earlier data preparation 

phase, variables showing near-perfect linear dependence and limited marginal contribu-

tion were excluded to reduce redundancy. For example, NI/NS (Net Income / Net Sales) 

was removed due to its very high correlation (r ≈ 0.96) with EBIT/NS, while the latter was 

retained as a more informative profitability measure with higher predictive relevance in 

preliminary model evaluations. Consequently, the correlogram presented in Figure 3 vis-

ualizes correlations among the refined set of predictors that were ultimately used in 

model training. 

 
To analyze the correlative relationships within variables in the dataset, the variables are 

plotted in correlograms, which can be found in Figures 3. Positive correlations are illus-

trated in red while negative correlations are illustrated in blue. Color intensity is propor-

tional to the correlation coefficient.  

 
38 

 
Figure 3. Correlogram. 

 
The strongest positive associations appear between Return on Equity (ROE) and Return 

on Assets (ROA), both profitability-based measures driven by net income performance. 

Similarly, EBIT/NS exhibits a strong positive correlation with ROE, indicating that firms 

with higher operating profitability typically achieve higher returns on equity. Moderate 

negative correlations emerge between leverage and profitability measures, such as be-

tween Total Debt to Total Assets (TD/TA) and ROA, suggesting that higher leverage is, on 

average, associated with lower returns. Liquidity ratios such as CA/CL show weaker or 

more heterogeneous relationships with profitability, indicating that short-term solvency 

conditions vary independently from performance and sustainability factors across sec-

tors.  

 
Overall, the correlation results indicate moderate interdependencies but no critical mul-

ticollinearity among the retained predictors. Tree-based models such as Random Forest 

and XGBoost, which form the primary modelling techniques in this study, are robust to 


39 

 
the remaining correlations due to their hierarchical structure. For linear models such as 

Ridge and Lasso regression, regularization further mitigates residual multicollinearity, 

which will be further discussed.  

 
From a broader perspective, the observed positive associations among size and profita-

bility ratios and ESG outcomes suggest that larger and more profitable firms can achieve 

higher sustainability ratings. While this does not imply causality, it hints that financially 

healthy firms may possess greater resources and incentives for sustainability practices.  

 
3.7 Density Functions of ESG Scores 

Figure 4 displays kernel density estimates (KDE) for the overall ESG score and its E, S and 

G score subcomponents. The dataset contains 4,937 firm-year observations. KDE is a 

non-parametric method for estimating the probability density function of a random var-

iable, providing a representation of how values are distributed across the sample (Silver-

man, 1986).  

 
Figure 4. Density functions of ESG score, E, S and G. 

 
40 

 
Table 6. Density functions of ESG score, E, S and G. 

 ESG score ESG E ESG S ESG G 

Sample Size 4937 4937 4937 4937 

Bandwidth 3.15344 4.27896 3.62623 3.8181 

 
The bandwidth determines how much the curve is smoothed: A smaller bandwidth pro-

duces a curve that follows local fluctuations more closely, and a larger bandwidth 

smooths over wider ranges, emphasizing the general shape of the distribution but po-

tentially hiding smaller peaks or irregularities.  

 
Bandwidths are determined using Silverman’s rule of thumb, resulting in values between 

approximately 3.1 and 4.3 for the ESG variables. The values in the figure mean that, for 

example, the ESG Score distribution is smoothed over windows of about three units on 

the 0–100 scale, while ESG E is smoothed slightly more broadly. 

 
The density curves show how ESG scores are not uniformly distributed. Most observa-

tions cluster in the 60-85 range, while relatively few observations locate at extremes. E 

and G distributions appear broader and more dispersed, and S has slightly sharper peak 

at high values. The density functions matter as they highlight both the central tendency 

and dispersion of ESG scores in the dataset.  

 
3.8 Missing Data Assessment and Handling 

It is important to examine the extent and nature of missing data before modelling, as 

missingness can affect both the reliability and interpretability of results. In the dataset, 

missing values occur across several variables with differing magnitudes and underlying 

causes. Table 7 summarizes the proportion of missing observations for each variable by 

industry classification. The missingness pattern is unevenly distributed, both in terms of 

variables and industries, suggesting that values are not missing completely at random. 

  
41 

 
This section focuses on describing the patterns and implications of missing values within 

the dataset, while the specific imputation procedures and justifications are presented 

separately in Section 3.9.; Pre-Modelling Procedures and Pipeline Design. 

 
Table 7. Missing values per industry table. 

 Industry 1 

Missing 

Industry 2 

Missing 

Industry 3 

Missing 

Industry 4 

Missing 

Industry 5 

Missing 

Industry 6 

Missing 

ESG score 0 0 0 0 0 0 

ESG E 0 0 0 0 0 0 

ESG S 0 0 0 0 0 0 

ESG G 0 0 0 0 0 0 

ASSETS 2 0 0 0 0 0 

EBIT/TA 14 0 0 14 6 0 

EBIT/NS 15 0 0 14 6 1 

TD/TA 2 0 0 0 0 0 

NS/TA 3 0 0 0 0 0 

CA/CL 2 0 0 424 293 411 

DIV Y 18 1 0 1 0 1 

P/E 251 32 9 46 15 22 

ROE 46 2 0 1 0 1 

ROA 10 1 0 55 0 1 

 
3.8.1 Extent and Distribution of Missingness 

Table 7. reports the proportion of missing observations for each variable across the six 

industry classifications. The pattern is uneven and concentrated in specific variables and 

sectors. The Current Assets to Current Liabilities (CA/CL) ratio shows the highest level of 

missingness, with approximately one-quarter of observations absent overall. This ab-

sence particularly occurs in the financial and real-estate sectors—industries 4 (Banks), 5 

(Insurance), and 6 (Real Estate and Investments)—where in several cases all CA/CL values 

are missing.  

 
42 

 
Another variable affected by substantial missingness is the Price-to-Earnings (P/E) ratio. 

Inspecting the data reveals that most missing P/E entries appear with negative Earnings 

Before Interest and Taxes (EBIT), making the ratio undefined rather than simply unre-

ported. Missingness is therefore embedded in the accounting structure of the variable. 

For the remaining financial ratios, missing values are comparatively rare and irregular, 

suggesting minor gaps in firm-level reporting rather than systematic omissions.  

 
3.8.2 Mechanisms of Missing Data  

These patterns imply that missingness is not Missing Completely at Random (MCAR), 

where the probability of missingness is independent of any variable in the dataset. Ra-

ther, the missingness mechanisms align with Missing at Random (MAR) or Missing Not 

at Random (MNAR).  

 
The CA/CL variable follows an industry-dependent pattern consistent with MAR, since 

the likelihood of missing values depends on a categorical factor (industry classification) 

observable in the data. In contrast, the P/E variable is closer to MNAR, because missing-

ness is systematically related to the unobserved (negative) earnings values that make P/E 

undefined. These mechanisms imply that missingness has economic meaning which 

needs to be considered.  

 
3.8.3 Implications for Data Integrity and Modelling 

Recognizing the origins of missingness has implications for how these gaps should be 

treated. Excluding all observations with missing values would remove entire industries 

and firms with negative earnings from the analysis, introducing selection bias and reduc-

ing the sample size, and simple mean imputation would ignore economically meaningful 

differences between industries or profitability levels. Therefore, a more context-sensi-

tive approach is needed, which allows the dataset to remain as complete as possible 

while preserving the interpretability of model relationships.  


43 

 
3.8.4 Treatment Principles 

In this thesis, for models that need complete data, the missing data are handled using 

variable-specific strategies that depend on the structure and origin of the missingness. 

For variables such as CA/CL and P/E, missingness itself carries interpretive value, indicat-

ing the presence of unique financial structures or earnings conditions. The chosen meth-

ods aim to retain their informational content while ensuring that models requiring com-

plete inputs can still be trained. The details of these procedures—including the use of 

sentinel values paired with missingness indicators, the rationale for applying them only 

to selected industries, and how they are implemented separately within training and test 

partitions—are presented in the “Pre-Modelling Procedures and Pipeline Design” chap-

ter 3.9. For variables with minor or unsystematic missingness, such as ROE, ROA, or Div-

idend Yield, a more conventional industry-mean imputation approach is applied, as de-

scribed there as well.  

 
In contrast, models such as XGBoost inherently manage missing values during tree con-

struction and therefore do not require these imputations.  

 
3.9 Pre-Modelling Procedures and Pipeline Design 

Before applying imputation methods, the data is randomly split into training (80%) and 

testing (20%) sets, to prevent data leakage.  

 
Next step of the data processing is to deal with the rest of the missing values. When 

choosing an imputation strategy, it should be considered why values are missing, and 

how extensive the missingness is.  

 
Examining the data, it was discovered that missingness is most prevalent in CA/CL and 

P/E - CA/CL having approximately quarter of the observations missing, which are occur-

ring in banks, insurance, and real estate and investment companies (Industries 4 to 6). 


44 

 
Since all the CA/CL observations in industries 4 and 6 are missing, imputation methods 

such as MissForest or mean imputation would not produce desirable outcomes for them, 

as the industries do not have values to interpret from. Due to this, in some models, for 

industries 4 and 6, an absolute value of -999 was imputed where CA/CL was missing, 

with missing flag to highlight the missingness. This compromise was the result of testing 

different methods to deal with this issue – including omitting CA/CL as variable. As a 

result, the models where this data processing was used were able to recognize this solu-

tion as a variable without real weight for predicting ESG Score, as the value of -999 is 

absurdly large in comparison to average CA/CL values, paired together with the missing-

ness flag and only appearing in two industries. This allowed for keeping the variable in 

the models that cannot internally handle missing values, as for real CA/CL values the 

variable was a contributing predictor for ESG Score.  

 
Another variable with high missingness was P/E. In the case of P/E, lots of the observa-

tions flagged for missing value also had negative Earnings Before Interest and Taxes. For 

those companies, similarly, due to most of the existing P/E values being positive, and 

missing values being negative by their real nature, typical imputation methods do not 

provide desirable outcomes. Because of this, for models XGB3 and RF2 missing P/E val-

ues were replaced with figure -999, when their EBIT was negative, and flagged with cat-

egorical variable of 1 to identify them, as most missing P/E values resulted from the com-

pany having negative earnings, and the models require a value for optimal performance, 

without having to sacrifice the predictor itself from the analysis, or the corresponding 

rows of the missing variable. This method did not appear to bias the results, and it pro-

vided higher prediction accuracy for Random Forest models than RF models that handled 

missing values internally, as with the missing flag, the model able to disregard imputed 

values that were way off from real ones, as well as being able to interpret the positive 

link between sustainability score and simply having a positive P/E value. This is due to 

the problem where when there are no negative P/E values, and thus the model is only 

comparing positive ones, the predictive power over sustainability score is low. The main 

predictive power of this variable only showed up when observations with missing value, 


45 

 
as in negative earnings, can be included. For observations with positive EBITs, missing 

P/E values were imputed with industry means.  

 
For other variables, the general missingness was low, and they were imputed with mean 

values of their respective industries, computed on the training set. The imputation re-

sults were then applied to the test set.  

 
It should be noted that the motivation behind these imputation choices for specific mod-

els was an aim to get the best results from models that have theoretical potential for 

competitive prediction accuracy but have limited functionality to accommodate for the 

weaknesses of the dataset, in terms of missing values. While these imputations im-

proved the performance of random forest models, they were ultimately outperformed 

by XGBoost model that handled missingness internally and did not require previously 

discussed imputations.  

 
3.10 Winsorisation 

Winsorisation is the process of moving outlier values to match values of a specific quan-

tile (Ranta, 2023, Ch. 9). Analysing the dataset, a winsorisation threshold of 1%-99% is 

selected, similarly to, for example, Chowdhury et al. (2023), as it appears as a literature 

standard for firm-level financial datasets of similar scale.  

 
For Random Forest, winsorisation was performed after the imputation stage to the train-

ing set to ensure that outlier caps were computed using the complete imputed distribu-

tions of each variable. P/E was capped only on the upper tail at the 99th percentile among 

positive values leaving sentinel unchanged, while other continuous variables were sym-

metrically capped at the 1st and 99th percentiles. The same percentile thresholds, esti-

mated from the training set, were applied to the test set to ensure consistent feature 

ranges without leaking target information. Finally, industry dummies were created and 


46 

 
aligned across splits. The adopted order maintains consistent data ranges and prevents 

the reappearance of outliers introduced by later steps.  

 
For XGBoost, winsorisation was applied to all continuous financial ratios to mitigate the 

influence of extreme outliers while preserving the overall rank order and structure of the 

data. ASSETS were winsorised, and only after turned into LOG(ASSETS). The 1st and 99th 

percentile thresholds were computed from the training set and then applied to both the 

training and test set, ensuring consistent feature ranges and preventing data leakage. As 

XGBoost natively handles missing values, winsorisation was applied only to the non-

missing values, leaving NaN entries untouched for internal treatment by the model.  

 
47 

 
4 Methodology 

This chapter provides a short overview of the different machine learning techniques 

used in the thesis, and follows by reviewing the model specifications, such as data pro-

cessing withing the model and model parameters.  

 
4.1 Machine Learning Models – general information on models used  

This subchapter provides short overview of the different machine learning techniques 

used in the thesis.  

 
4.1.1 Random Forest 

Introduced by Leo Breiman in 2001, Random Forest is an ensemble learning method, 

that constructs a multitude of decision trees and combines their results for predictions 

(Ramchandran et al., 2021). The random forest algorithm selects a random subset of fea-

tures for each weak estimator known as random subspace method (Ranta, 2023, Ch. 

8). The idea of random forest is to reduce overfitting and improve accuracy by averaging 

many uncorrelated decision trees; While a single decision tree can be prone to bias, a 

random forest reduces this by training multiple trees on random subsets of data, then 

aggregating their outputs (IBM, n.d.-a). This so called bagging, or bootstrap aggregating 

approach with feature randomness makes that each tree provides a unique result, and 

their average is more generalizable than any tree individually (IBM, n.d.-a). In practice, 

each tree is grown on a bootstrap sample of the training dataset, and at each split only 

a random subset of features is considered as candidates for splitting (Breiman, 2001). 

This process introduces diversity among the trees, preventing any single feature or data 

point from dominating the model. After training, the forest makes predictions by aggre-

gating the trees’ outputs. Random forest is capable of both classification and regression 

tasks, and for a regression problem, the model averages the predicted values from all 

trees, and for classification, it takes a majority vote among the trees (IBM, n.d.-a). 


48 

 
4.1.2 XGBoost 

Like random forests, gradient boosting is an ensemble method that combines multiple 

decision trees, but unlike random forest, it’s boosting builds trees sequentially rather 

than in parallel (NVIDIA, n.d.). In gradient boosting, each new tree is trained to correct 

the errors (residuals) of the combined previous trees, gradually improving the model’s 

performance (NVIDIA, n.d.). The model can be seen as a weighted sum of multiple weak 

learner trees, which together form a model with strong predictive power. The final model 

comes together by combining many small decision trees, each one correcting the mis-

takes of the previous ones. In contrast to random forests, which reduce variance by av-

eraging many independent trees, boosting reduces bias by gradually improving the fit 

through a sequence of corrections (NVIDIA, n.d.).  

 
XGBoost (Extreme Gradient Boosting) is one of the most successful implementations of 

gradient boosting tenchniques (Ranta, 2023, Ch. 8). It extends the basic gradient boost-

ing framework with engineering improvements and regularization, enhancing model 

speed and accuracy (NVIDIA, n.d.). XGBoost builds decision trees using a novel level-wise 

parallelization strategy: instead of the strictly sequential tree-by-tree boosting process, 

XGBoost can grow trees in parallel by processing multiple splits at once (NVIDIA, n.d.). It 

also includes regularized learning objective, which adds a penalty for model complexity 

to the loss function, helping prevent overfitting and improving generalization (NVIDIA, 

n.d.). In practice, XGBoost’s objective at each iteration minimizes a combination of the 

gradient-based loss (e.g. mean squared error) and a regularization term that penalizes 

overly complex trees (NVIDIA, n.d.).  

 
4.1.3 RIDGE and LASSO Regression 

Ridge and Lasso regression extend the scope of linear regression models by including 

regularization (Ranta, 2023, Ch. 9). The methods constrain the coefficient estimates and 

force them towards zero. The models address common regression problems of multicol-

linearity and overfitting, improving robustness of the models compared to traditional 


49 

 
regression. Ridge and Lasso are considered as best regularization techniques (Ranta, 

2023, Ch. 9) 

 
In a standard linear regression, the model finds coefficients that minimize the residual 

sum of squares, fitting the data as closely as possible. Both ridge and lasso methods add 

a penalty to the size of the regression coefficients to the loss function, which prevents 

the model from relying heavily on any specific variable, making the results more stable 

(IBM, n.d.-b).  

 
Ridge regression (L2 regularization) shrinks coefficients towards - but never fully to zero, 

retaining all predictors in the model, but reducing their influence (IBM, n.d.-b). Ridge is 

useful when predictors are correlated, as it spreads their effect more evenly (IBM, n.d.-

b).  

 
Lasso regression (L1 regularization) can also shrink coefficients to zero. In practice, Lasso 

can automatically select a smaller subset of predictors, that can improve model inter-

pretability and efficiency (IBM, n.d.-b). Lasso is valuable when identifying and selecting 

the most important predictors is a priority. Both models balance model complexity and 

prediction accuracy in different ways (IBM, n.d.-b).  

 
4.2 Trained Models  

This subchapter further reviews the model specifications, such as data processing with-

ing the model and model parameters for trained models. The models were implemented 

and executed in Python using the Google Colab environment.  

 
4.2.1 XGB1 

XGBoost model trained on original data without imputation of missing variables, as the 

model can internally deal with missing values. This was best performing model with both 


50 

 
ASSETS and SIZE (LOG(ASSETS)), and performed best after retraining best parameters for 

model using SIZE around prior best searches, with: 

 
Parameter distributions of = { 

  "max_depth": randint   [6, 18],     

  "min_child_weight": randint  [1, 10],   

  "learning_rate": loguniform  [0.01, 0.3],  

  "n_estimators": randint  [150, 600],   

  "subsample": uniform   [0.6, 0.4],      

  "colsample_bytree": uniform  [0.6, 0.4],  

  "gamma": uniform    [0.0, 2.0],             

  "reg_alpha": loguniform  [1e-4, 10.0],     

  "reg_lambda": loguniform  [1e-3, 30.0],    

}, 

 
yielding in best parameters of:   

{learning_rate≈0.024, max_depth=16, n_estimators=356, sub-

sample≈0.643, colsample_bytree≈0.889, gamma≈0.472, 

min_child_weight=9, reg_alpha≈0.014, and reg_lambda≈0.0013}. 

 
The parameter distribution is applicable for following XGboost models as well.  

 
Hyperparameter optimization for the XGBoost model was conducted using scikit-learn’s 

RandomizedSearchCV, which samples random combinations of parameter values 

from pre-specified ranges and evaluates each using five-fold cross-validation (cv=5). 

Sixty parameter combinations were evaluated (n_iter=60) and the configuration 

yielding the lowest cross-validated RMSE was selected. The model was then refitted on 

the full training set using these optimal parameters (refit=True) to obtain the final 

tuned estimator. This process of hyperparameter optimization is applicable for the fol-

lowing XGBoost and Random Forest models as well.  

 
Table 8 below describes what the XGBoost hyperparameters control in the model, and 

their effects on model learning, based on the official XGBoost documentation (XGBoost 

Developers, 2024).  

 
51 

 
Table 8. Descriptions of XGBoost hyperparameters (XGBoost Developers, 2024).  

Parameter Description 

learning_rate Shrinkage factor applied to each tree’s contribution. A smaller value slows learn-

ing and requires more trees, reducing overfitting. 

max_depth Max depth of individual trees. Deeper trees capture more complex nonlinear re-

lationships but increase overfitting risk. 

n_estimators Number of boosting rounds. Determines total model complexity with learning 

rate: more trees compensate for lower learning rate. 

subsample Fraction of the training data randomly sampled for each tree. Introduces random-

ness improving generalization and reducing overfitting. 

colsample_bytree Fraction of features randomly selected for each tree. Reduces feature correlation 

effects and increases model diversity. 

gamma Minimum required loss reduction to make a further split. Acts as a regularization 

term: higher values make the algorithm more conservative, pruning weak splits. 

min_child_weight Minimum sum of instance weights in a child node. Prevents the creation of nodes 

representing too few samples or low variance, stabilizing deep trees and control-

ling overfitting. 

reg_alpha L1 regularization term on leaf weights. Encourages sparsity in tree leaf weights, 

reducing the number of active leaves and simplifying the model. 

reg_lamba L2 regularization term on leaf weights. Penalizes large weight magnitudes to re-

duce model variance and improve generalization stability. 

 
4.2.2 XGB2 

Same data and model as XGB1, but uses ASSETS instead of SIZE, and with that has differ-

ent best parameters.  

 
Best Parameters: 

{'colsample_bytree': 0.9, 'learning_rate': 0.1,   

 'max_depth': 15, 'min_child_weight': 5, 'n_estima  

 tors': 290, 'subsample': 0.8} 

 
52 

 
4.2.3 XGB3 

Trained on data with same preprocessing steps as RF2. 

 
Best Parameters:  

 {'colsample_bytree': 0.9, 'learning_rate': 0.1,    

 'max_depth': 15, 'min_child_weight': 5, 'n_estima  

 tors': 290, 'subsample': 0.8} 

 
4.2.4 RF1 

This RandomForest model trained on data where CA/CL is dropped, as approximately 25% 

of observations are missing. For low missingness, uses mean industry imputation. For 

P/E, mean imputation where EBIT is positive, -999 where EBIT is negative. P/E missing 

flag added for missing observations. Hyperparameters tuned with RandomizedSearchCV, 

yielding: 

 
Best Parameters:  

{'max_depth': 14, 'min_samples_leaf': 1, 'min_samp  

 les_split': 4, 'n_estimators': 140} 

 
4.2.5 RF2 

CA/CL imputed with -999 for industries 4-6 when missing. For missing in industries 1-3, 

mean industry imputation is applied. CA/CL missing flag added for missing observations. 

For P/E, mean imputation is applied where EBIT is positive, -999 where EBIT is negative. 

P/E missing flag added for missing observations.  Hyperparameters were searched using 

RandomizedSearchCV, with: 

 
Parameter distributions of ={ 

"n_estimators":    [300, 500, 800], 

"max_depth":     [None, 8, 12, 16, 20], 

"min_samples_split":   [2, 4, 6, 10], 

"min_samples_leaf":   [1, 2, 4, 8], 

"max_features":    ["sqrt", "log2", 0.3, 0.5, 0.7, 

1.0], 


53 

 
"bootstrap":     [True], 

"max_samples":    [None, 0.6, 0.8], 

"criterion":   ["squared_error","absolute_er-

ror"], 

"min_impurity_decrease":  [0.0, 1e-6, 1e-5, 1e-4], 

"max_leaf_nodes":    [None, 128, 256], 

"ccp_alpha":     [0.0, 1e-4, 1e-3] 

} 
 

Resulting in best parameters of:   

{'n_estimators': 800, 'min_samples_split': 6, 'min_sam-

ples_leaf': 1, 'min_impurity_decrease': 1e-05, 'max_samples': 

None, 'max_leaf_nodes': None, 'max_features': 1.0, 

'max_depth': 16, 'criterion': 'squared_error', 'ccp_alpha': 

0.0001, 'bootstrap': True}. 

 
Table 9 below describes what the RandomForest hyperparameters control in the model, 

and their effects on model learning, based on The RandomForestRegressor implemen-

tation in scikit-learn (scikit-learn developers, 2024).  

 
Table 9. Descriptions of RandomForest parameters (scikit-learn developers, 2024). 

Parameter Description 

n_estimators The number of trees in the ensemble. Larger numbers reduce variance through av-

eraging to a certain point. 

max_depth Maximum depth of individual trees. Controls the detail of each tree’s partitions. 

Low depth simplify relationships and reduce overfitting; deeper trees allow more 

complex patterns but increase variance. 

min_samp-

les_split 

Minimum number of samples required to split an internal node. Higher values: 

fewer splits and smoother predictions. Lower values: deeper, more detailed trees. 

min_samp-

les_leaf 

Minimum number of samples required to form a leaf node. Prevents leaves that 

represent very small sample subsets. Increase smooths prediction. 

max_features Proportion of predictors randomly considered at each split. Smaller values increase 

model diversity and reduce overfitting; larger values allow each tree to fit more ac-

curately to the training data. 

bootstrap Whether to sample training observations with replacement when building each 

tree. The default “True” enables classical bagging, producing more diverse trees 

and allowing estimation of out-of-bag error for validation. 


54 

 
max_samples If set (<1.0), limits the proportion of the training data used for each bootstrap sam-

ple. Using less than the full sample increases tree diversity and training speed but 

can increase bias. 

criterion Measure split quality. "squared_error" minimizes mean-squared deviations (de-

fault for regression); "absolute_error" is more robust to outliers and yields me-

dian-based predictions. 

min_impu-

rity_decrease 

A split is performed only if it decreases impurity (MSE or MAE) by at least this 

value. Acts as a small regularization term, preventing, low-gain splits and reducing 

overfitting. 

max_leaf_nodes Caps the number of terminal nodes in each tree. Provides a direct upper bound on 

tree complexity, similar to limiting max_depth. Smaller values simplify the model 

and control overfitting. 

ccp_alpha Cost-complexity pruning parameter. After trees are built, branches that contribute 

less than ccp_alpha to overall model performance are pruned. Larger values yield 

simpler, more regularized models.  

 
4.2.6 RF3 

Instead of Sklearn library’s RandomForestRegressor, this model used Xgboost library’s 

XGBRFRegressor. It uses random forest-style bagging instead of boosting residuals and 

can natively handle missing values, with 

 
Param grid: { 

  "n_estimators":      [100, 200, 300, 500], 

  "max_depth":          [3, 5, 7, 9, 12], 

  "subsample":          [0.6, 0.8, 1.0], 

  "colsample_bynode":   [0.6, 0.8, 1.0], 

  "reg_alpha":          [0.1, 0.5, 1], 

  "reg_lambda":        [1, 2, 3], 

} 

 
Resulting in best parameters of: 

{‘n_estimators’: 200, ‘max_depth’: 12, ‘subsample’: 1.0,  

‘colsample_bynod’: 0.6, ‘reg_alpha’: 1, and ‘reg_lambda’: 1} 

 
55 

 
4.2.7 RIDGE 

L2 Regularization. Ridge minimizes the squared error with a penalty on the squared mag-

nitude of coefficients. It helps prevent overfitting by shrinking coefficients, especially 

useful when predictors are correlated. Ridge regression natively handles multi-target 

outputs. 

Variables with excessive missingness dropped: CA/CL, P/E. For others with low missing-

ness, industry mean imputation was applied. Because asset values typically span several 

orders of magnitude and can be right-skewed, a logarithmic transformation was applied. 

model was trained with sklearn.linear_model.Ridge. Features were standardized 

using StandardScaler, and the model used a fixed penalty term alpha=1.0.  

 
4.2.8 LASSO 

L1 Regularization for Multi-Output. LASSO encourages sparsity by penalizing the abso-

lute values of coefficients, which can set some coefficients exactly to zero. The Multi-

TaskLasso variant enforces a common sparsity structure ac