1 

Antero Vuorela 

Enhancing Large Language Models Through Post-
Training, External Tool Integration and External 

Information Integration 

 
Vaasa 2025 

School of Technology and Innovations  
Bachelor's thesis  

Automation and information technology 


2 

 
UNIVERSITY OF VAASA 
School of Technology and Innovations 

Author: Antero Vuorela 
Title of the thesis:  Enhancing Large Language Models Through Post-Training, External 

Tool Integration and External Information Integration 
Degree: Bachelor of Technology 
Discipline: Automation and Information Technology 
Supervisor: Janne Koljonen  
Year: 2025 Pages: 30 

 
ABSTRACT: 
 
Suuret kielimallit (Large Language Models, LLMs) ovat nousseet keskeiseksi osaksi nykyaikaista 
tekoälyä, mahdollistaen sovelluksia keskusteluboteista sisällöntuotantoon ja tutkimusavustajiin. 
Vaikka laajamittainen esikoulutus tekstiaineistoilla on tuottanut vaikuttavia yleiskielisiä kyvyk-
kyyksiä, se on myös jättänyt merkittäviä luontaisia rajoituksia kuten taipumus hallusinaatioihin, 
vanhentunut tieto ja rajallinen päättelykyky. Tämä kandidaatintyö tarkastelee keinoja parantaa 
suurien kielimallien suorituskykyä jälkikoulutuksella sekä ulkoisten työkalujen ja informaation 
integrointia osaksi mallien toimintaa. 
 
Työssä tutkitaan useita jälkikoulutusmenetelmiä, mukaan lukien valvottu hienosäätö (supervi-
sed fine-tuning), vahvistusoppiminen ihmisen tai tekoälyn palautteen avulla (RLHF, RLAIF), suora 
preferenssien optimointi (DPO) sekä päättelytehoa kasvattava testiaikaskaalaus (Test-time sca-
ling). Lisäksi käsitellään ulkoisten tietolähteiden hyödyntämistä, kuten hakuun perustuvaa teks-
tintuottoa (Retrieval-augmented generation) ja työkalujen sekä sovellusliittymien integrointia, 
mukaan lukien Model Context Protocol. Analyysi osoittaa, että nämä menetelmät täydentävät 
toisiaan, ja parhaan lopputuloksen saavuttamiseksi niitä kannattaa yhdistää järjestelmällisesti. 
 
Tulokset osoittavat, että mikään yksittäinen menetelmä ei riitä ratkaisemaan kaikkia LLM-mal-
lien haasteita. Sen sijaan suorituskyky ja luotettavuus paranevat selvästi, kun eri tekniikat yhdis-
tetään harkitusti osaksi kokonaisuutta. Tämä työ tarjoaa katsauksen suurten kielimallien jälki-
koulutus menetelmistä ja suosituksia älykkäämpien, kontekstitietoisempien sovellusten raken-
tamiseen. 
 
 
KEYWORDS: large language models, post-training, fine-tuning, reinforcement learning, re-
trieval-augmented generation, tool integration, AI alignment, model context protocol 


3 

 
Contents 

1 Introduction 5 

2 Large Language Models 7 

2.1 Overview of Large Language Models 7 

2.2 Foundational technology 7 

2.2.1 Machine learning and deep learning 7 

2.2.2 Natural language processing and text representation 8 

2.2.3 Language modelling and transformer architecture 9 

2.2.4 Foundation models 11 

2.3 The function of large language models 11 

2.4 The importance of post-training improvement of large language models 12 

3 External tool usage and post-training enhancement of LLMs 14 

3.1 Post-training techniques 14 

3.2 External tool and information integration 19 

4 Discussion 23 

4.1 Post-training techniques analysis 23 

4.2 External tool and information integration 23 

4.3 Integration strategies 24 

4.4 Practical considerations and trade-offs 25 

5 Conclusion 26 

References 28 


4 

 
Figures 
 
Figure 1. The transformer model architecture. 10 

Figure 2. The supervised fine-tuning process. 15 

Figure 3. Comparison of RLHF and RLAIF for preference alignment. 16 

Figure 4. RLHF and DPO compared. 16 

Figure 5. The process of OAIF. 17 

Figure 6. Best-of-N sampling and sequential revisions compared. 18 

Figure 7. The retrieval-augmented generation process. 20 

Figure 8. Tool use with and without MCP compared. 21 

 
Tables 
 
Table 1. Enhancement techniques divided in layers based on their function. 24 

 
Abbreviations 
 
AI  Artificial Intelligence 
API  Application Programming Interface 
CoT  Chain-of-Thought 
DAP  Direct Alignment from Preferences 
DL  Deep Learning 
DPO  Direct Preference Optimization 
LLM  Large Language Model 
MCP  Model Context Protocol 
ML  Machine Learning 
NLP  Natural Language Processing 
OAIF  Online AI Feedback 
RAG  Retrieval-Augmented Generation 
RL  Reinforcement Learning 
RLHF  Reinforcement Learning with Human Feedback 
RLAIF  Reinforcement Learning with AI Feedback 
SFT  Supervised Fine-Tuning 
TTC  Test-Time Computation 
UI  User Interface 
 

5 

1 Introduction 

Large language models (LLMs) have transformed the landscape of natural language pro-

cessing, yet they often show limitations in areas such as reasoning capabilities, ethical 

understanding, and domain specific performance. These issues largely stem from their 

pre-trained architectures (Tie et al., 2025). While pre-training on vast corpora has laid 

the foundations for these models, research is shifting toward post-training techniques 

to achieve further improvements (Kumar et al., 2025). 

 
Post-training is a way of specializing a language model after the foundations have been 

laid in pre-training. Current techniques include fine-tuning, which uses curated data to 

update the parameters and specialize the model, reinforcement learning, which uses 

feedback to alter the response of the model and test-time scaling which allows the 

model to allocate computational resources during inference to allow for more reasoning 

(Kumar et al., 2025). For further improving the accuracy of LLMs, techniques such as 

retrieval-augmented generation to allow models to access external data (Gao et al., 

2023), and external tool integration for better accuracy in answering questions (Zhuang 

et al. 2023) should be considered. 

 
The central question this thesis examines is: How can we effectively build on top of large 

language models through post-training training techniques as well as external tool and 

information integration. By investigating these interconnected strategies, this bachelor’s 

thesis aims to provide insights for alternative approaches for significantly improving the 

performance of LLMs post pre-training. 

 
The purpose of this bachelor’s thesis is to examine the different ways in which large lan-

guage models can be improved post-training using different tools and techniques. The 

study will analyse how different post-training techniques can improve model perfor-

mance, assess the benefits of contextual data integration, and explore the role of exter-


6 

nal tool integration in expanding the functionality of LLMs. The goal is to build a com-

prehensive understanding of post-training enhancement techniques and look at how 

they complement each other. 

 
In chapter 2, a general overview of LLMs will be given along with an examination of why 

post-training enhancement techniques of LLMs are important. Chapter 3 will give a more 

in-depth look into the function and potential of different post-training techniques. Chap-

ter 4 focuses on analysis of the techniques discussed in the previous chapter, with the 

aim of finding the best ways of integrating post-training techniques. Chapter 5 will con-

clude the thesis and summarize the findings. 

 
7 

2 Large Language Models 

Large language models (LLMs) are a relatively new AI technology and the fast develop-

ment of them has outpaced education about it. This section aims to explain the founda-

tional AI technology that makes modern LLMs possible, explain what LLMs are and how 

they work and explain why post-training improvements of LLMs are important. 

 
2.1 Overview of Large Language Models 

Large language models are a significant breakthrough in AI and natural language pro-

cessing. They are a type of foundation model that can generate and understand language 

and other content to accomplish diverse tasks. An LLM is a machine learning model with 

a very large number of parameters, often billions or trillions, which is trained on massive 

text datasets. LLMs can infer from context, generate logical human like responses, trans-

late between languages, summarize, answer questions and assist with coding and they 

are already revolutionizing many fields such as chatbots, virtual assistants, content gen-

eration, research assistance and language translation (IMB, 2023). 

 
2.2 Foundational technology 

Large Language Models (LLMs) are a recent development in the field of generative AI, 

and they are only possible due to big innovations in subfields of AI such as deep learning, 

natural language processing and transformer models. This section aims to give context 

on what technologies LLMs rely on. 

 
2.2.1 Machine learning and deep learning 

Most current AI systems use machine learning. In AI predictive machine learning, models 

trained on historic data are used to make future predictions. The rise of machine learning 


8 

changed the way AI systems were built previously, rather than giving the program in-

structions on how to solve the task, machine learning would learn it from the data (Bom-

masani et al., 2022). Deep learning is a subset of machine learning that simulates the 

decision-making power of the human brain by using neural networks with multiple layers 

(Holdsworth & Scapicchio, 2024). Deep learning differs from conventional machine 

learning by its ability to effectively process natural unannotated data (LeCun et al., 2015). 

 
2.2.2 Natural language processing and text representation 

Natural language processing (NLP) is a subfield of computer science and artificial intelli-

gence that uses machine learning to enable computers to understand and communicate 

using human language. Natural language processing merges rule-based computational 

linguistics with statistical modelling, machine-learning algorithms, and deep learning, 

enabling computers and digital systems to detect, interpret and generate text and 

speech (Stryker & Holdsworth, 2024). 

 
In pre-training LLMs tokenize text, meaning the process of converting text into se-

quences of tokens. The tokens can be words, sub-words, or characters. A token is the 

smallest unit of meaning in text that a language model can understand. Each token is 

assigned its own unique integer after tokenization. (Gondal, 2024). For language models 

to incorporate contexts of ever larger scope, embeddings were implemented to effec-

tively use unlabelled data in a self-supervised manner. In language modelling embed-

dings are divided into static- and dynamic embeddings. Static embeddings are pre-

trained into the model and capture the meaning of the word based its use in the training 

dataset. Later deep neural network models were able to derive dynamic embeddings, 

which allowed for the vector representation of a token to change based on the context 

of the surrounding words (Patil & Gudivada, 2024). 

 
9 

2.2.3 Language modelling and transformer architecture 

Language modelling is a key task to NLP and Language Understanding. Language 

modelling works by learning a probability distribution over sets of characters relating to 

a language. A language model (LM) can inherit grammatical structure and distil 

information from corpora (Jozefowicz et al., 2016). In practice, an LM learns to predict 

the next word given previous context. 

 
Transformer models are a form of neural network architecture that are great at pro-

cessing sequential data. Transformer models are most prominently used for natural lan-

guage processing tasks for LLMs but are also good at other AI fields such as computer 

vision and speech recognition.  


10 

 
Figure 1. The transformer model architecture (Avaswani et al., 2017). 

 
What makes transformer models powerful is their inherent self-attention mechanism as 

seen in Figure 1. The attention layer gives transformer models a much greater ability to 

discern the relationships and dependencies between different parts and words of the 


11 

imputed sentence compared to the preceding architectures such as recurrent neural 

networks and convolutional neural networks. (Stryker & Bergmann, 2025) 

 
2.2.4 Foundation models 

Foundational models are models trained on a wide range of unlabelled data that can be 

used for different tasks, in contrast to the previous task-specific AI models (Murphy, 

2022). Foundational models are a central but incomplete part of AI, prominently done 

with natural language processing. Foundational models are enabled by scale and transfer 

learning. Transfer learning is what makes foundational models possible, its purpose is to 

use the comprehension gained from one task to apply it to another. Scale is what makes 

foundational models powerful (Bommasani et al., 2022). Massive training datasets are 

what makes modern LLMs possible. 

 
2.3 The function of large language models 

LLMs work by using deep learning with massive amounts of textual data. They are gen-

erally based on a transformer architecture, which are great at understanding sequential 

text data. LLMs are multilayered neural networks and each layer has parameters that can 

be fine-tuned in training. There is also an attention mechanism, which focuses on spe-

cific parts of the data sets (IMB, 2023). 

 
The development of LLMs is divided into two main parts, pre- and post-training. The 

purpose of pre-training is to use massive amounts of data to train a foundational model. 

A foundational model allows easy fine-tuning for different use cases in the post-training 

phase. 

 
12 

In the training phase, LLMs train predicting the following word in a sentence using the 

context given by the earlier words. This is done by giving a probability score to the pre-

vious tokens, which are then transformed into embeddings which together numerically 

represent the context of the sentences and allows the LLM to predict the following words 

(IMB, 2023). 

 
For LLMs to be accurate, they need to train on massive amounts of data, allowing them 

to learn grammar, semantics, and conceptual relationships. After training, LLMs can gen-

erate text by predicting the next word based on the user input and utilizing the patterns 

and information they have gained (IMB, 2023). 

 
2.4 The importance of post-training improvement of large language 

models 

LLMs have revolutionized NLP allowing for diverse applications across various domains. 

However, they remain constrained by their inherent limitations such as limited context 

lengths, tendencies to hallucinate meaning to make up information, suboptimal reason-

ing proficiency and inherited biases (Tie et al., 2025). Performance improvements of lan-

guage models have mainly relied on the scaling of train-time compute using self-super-

vised pre-training (Kaplan et al., 2020; Hoffmann et al., 2022). Pre-training with massive 

amounts of data has set the foundation for LLMs, but research is increasingly shifting 

towards post-training techniques for further improvements (Kumar et al., 2025).  

 
Metz et al. (2024) reveal that some of the largest companies in the LLM base such as 

OpenAI, Google and Anthropic have been struggling to meaningfully improve on their 

previous models. They claim that part of the reason for this is the difficulty of finding 

untapped sources of high-quality data. This emerging plateau in model performance has 

shifted the focus to the exploration of alternative enhancement strategies beyond raw 

scaling. According to Metz et al. (2024), rather than relying solely on ever-larger datasets, 

researchers are increasingly investigating methods that enable models to improve post-


13 

training. One thing they highlight is OpenAI’s o1 model, which spends more time com-

puting the answer before giving an output, a process which OpenAI calls reasoning. 

 
14 

3 External tool usage and post-training enhancement of 

LLMs 

The reasoning, ethical and domain specific performance of pre-trained LLM architec-

tures remain limited necessitating a move into advanced post-training language models 

(Tie et al., 2025). For further improving the accuracy of LLMs, techniques such as re-

trieval-augmented generation (Gao et al., 2023) and external tool integration (Zhuang et 

al. 2023) should be considered. While this chapter focuses on the individual enhance-

ment techniques, chapter 4 will give a more comprehensive overview of the different 

roles and integration strategies of post-training enhancement techniques. 

 
3.1 Post-training techniques 

While pre-training provides a broad lingual foundational model, post-training is an im-

portant step to refine the knowledge, reasoning capabilities, factual accuracy, alignment 

with user intent and ethical considerations of the model (Kumar et al., 2025). In short, 

the purpose of post-training is to specialize the model in a chosen direction. 

 
Fine-tuning is a post-training technique with the purpose of specializing a model using 

curated data. Many different fine-tuning techniques exist for different purposes. Super-

vised fine-tuning (SFT) is used to refine a pretrained language model on a supervised 

dataset containing high-quality human-made examples. The purpose of SFT is ensuring 

the model attains to compliance with style and format guidelines (Kumar et al., 2025). 

Put simply, SFT teaches the model to mimic ideal responses. 


15 

 
Figure 2. The supervised fine-tuning process (Tie et al., 2025). 

 
As depicted in Figure 2, the SFT process begins with a pre-trained model. During fine-

tuning, the model is aligned with the requirements of the given application. This is done 

by adjusting the parameters using task-specific annotated data (Tie et al., 2025). Fine-

tuning forms the foundation for post-training, and it comes in many forms. Instruction 

fine-tuning, for example, guides models to follow user instructions accurately and help-

fully. In contrast, dialogue fine-tuning uses chat transcripts to improve multi-turn con-

versational ability. Chain-of-thought (CoT) reasoning fine-tuning is used to teach models 

to produce reasoning traces by training on supervised reasoning annotations. This can 

improve both the explainability and accuracy of the model on complex tasks. (Kumar et 

al., 2025). 

 
Reinforcement learning (RL) allows an agent to learn behaviour through trial-and-error 

interactions with a dynamic environment (Kaelbling, Littman & Moore, 1996). A common 

purpose of using reinforcement learning for LLMs is to align the model’s behaviour in a 

desired direction based on feedback, but it can also be used to increase the reasoning 

capabilities of a model. For aligning the model, reinforcement learning with human feed-

back (RLHF) is an essential fine-tuning method. This approach works by using a reward 

model that explicitly takes human input, allowing the model to adapt more closely to 

human preferences (Tie et al., 2025). Unlike SFT, RLHF trains a model to generalize past 

specific examples to gain a nuanced understanding of the preferences of humans. For a 

more scalable and low-cost solution, reinforcement learning with AI feedback (RLAIF) is 

an alternative solution (Tie et al., 2025). 


16 

 
Figure 3. Comparison of RLHF and RLAIF for preference alignment (Tie et al., 2025). 

 
Direct alignment from preferences (DAP) methods are a recent development for alter-

native methods in preference alignment (Guo et al., 2024). Direct preference optimiza-

tion (DPO) was first presented by Rafailov et al. (2023), as a more computationally effi-

cient and stable alternative to RLHF. DPO makes the reward optimization process simpler 

by linking the reward function to the optimal policy directly. Instead of training a reward 

model and fine-tuning a language model through reinforcement learning, DPO trains the 

LM directly with human preference data thus removing the need for fitting a reward 

model (Tie et al., 2025). 

 
Figure 4. RLHF and DPO compared (Rafailov et al., 2025). 

 
As seen in Figure 4, without explicit reinforcement learning or reward modelling DPO 

simplifies the process. For all the benefits of DAP methods, they tend to suffer from a 

lack of online feedback. The preference datasets for DAP methods are generally gathered 

pre training phase and the responses in the datasets generated by different LLMs. This 


17 

leads to the feedback of DAP methods being completely offline as the LLM policy in train-

ing is unable to get feedback for its own outputs The end result is a distribution between 

the policy being aligned, and the one that generated the dataset. This is in contrast to 

RLHF, where the reward model gives online feedback to the LLM policy from its outputs 

during reinforcement learning (Guo et al., 2024). 

 
As the next step from RLHF and DAP methods, Guo et al. (2024) propose online AI feed-

back (OAIF). OAIF promises to include both the practicality of DAP methods and the 

online nature of RLHF. OAIF, specifically the alignment of the LLM policy, works by sam-

pling two responses to a prompt from the current policy, obtaining online feedback by 

prompting an LLM to give preference annotations and using the feedback to update the 

model policy trough standard DAP losses (Guo et al., 2024). This approach is shown in 

Figure 5. 

 
Figure 5. The process of OAIF (Guo et al., 2024). 

 
Reinforcement learning for reasoning entails the use of a reward-based optimization to 

improve the model’s chain-of-thought capabilities (Tie et al., 2025). In much the same 

way that humans spend more time thinking on complex problems, reinforcing a model 

to think tasks through step-by-step can help them with long-term reasoning tasks (Tie et 

al., 2025). 

 
18 

Test-time computation (TTC), or inference, is the process during which the model takes 

input data and uses its learned parameters to generate an output. Test-time scaling, as 

proposed by Snell et al. (2024), allows a model to allocate additional computational re-

sources during the inference phase of the model’s operation. This approach allows the 

model to process challenging prompts more thoroughly, similarly how humans spend 

more time thinking about complex problems to improve their answer. Best-of-N sam-

pling is perhaps the simplest approach to scaling TTC (Snell et al., 2024). As seen in Figure 

5, the premise of best-of-N is generating multiple outputs in ¨parallel¨ and selecting the 

one that scores highest with a learned verifier (Cobbe et al., 2021) or a reward model 

(Lightman et al., 2023). As seen in Figure 5, another option is to use sequential revisions, 

each generated in sequence and conditioned on previous attempts, or the combination 

of the two methods (Snell et al., 2024). 

 
Figure 6. Best-of-N sampling and sequential revisions compared (Snell et all., 2024). 

 
The viability of scaling test-time compute was recently demonstrated by OpenAI with 

their o1 model, the approach of which they describe as using large-scale RL (OpenAI, 

2024; Muennighoff et al., 2025). Soon after o1´s release, DeepSeek R1 has been able to 

replicate o1-level performance similarly using RL with multiple training stages and mil-

lions of samples (DeepSeek-AI et al., 2025; Muennighoff et al., 2025). However, OpenAI´s 

and DeepSeek´s advancements in test time scaling lack transparency and limit research 

progress. In an effort to create a simple and open test-time scaling solution, Muennighoff 


19 

et al. (2025) propose forcing a maximum and/or minimum number of thinking tokens as 

a decoding-time intervention mechanism called budget forcing. Budget forcing can ex-

tend inference by suppressing the end-of-token delimiter and optionally injecting a to-

ken such as wait into the reasoning trace to help the model reflect on what it has gener-

ated. While budget forcing did improve model performance, the performance gains were 

found to flatten out when scaling further and it is limited by the model’s context window 

(Muennighoff et al., 2025). Despite these limitations, test-time scaling has already been 

proven viable and further research will bring it to open models. 

 
3.2 External tool and information integration 

As LLMs have a propensity to hallucinate and their numerical reasoning capabilities re-

main limited, external tools and data can be used to enhance the capability of LLMs to 

answer questions (Zhuang et al., 2023). 

 
Without an ability to retrieve external data, basic pre-trained LLMs tend to suffer from a 

lack of context when faced with request dealing with information that is not included in 

the training data, is highly specific or is up to date. For this reason, techniques like re-

trieval-augmented generation (RAG) have been created to allow LLMs to access external 

information. Retrieval-augmented generation allows language models to integrate ex-

ternal knowledge sources into the generation process. This addresses inherent limita-

tions of LLMs, such as hallucinations, meaning the making up of information, outdated 

information and as the source of the information provided by the LLM is usually unclear, 

a lack of transparency in answers. As depicted in Figure 6, RAG works by retrieving rele-

vant information from external sources before generating a response. This allows LLMs 

to access up-to-date and domain-specific information that might not be present in the 

training data. By combining the general knowledge of the model learned in the training 

phase with external up-to-date and domain specific information, RAG enhances the re-

liability and accuracy of the generated text (Gao et al., 2023). While RAG addresses the 


20 

problems of knowledge cutoff and suboptimal model accuracy, it remains limited to pas-

sive retrieval of information, meaning it a RAG based system cannot take further action 

than providing textual responses (Huo et al., 2025). 

 
Figure 7. The retrieval-augmented generation process (Prompt Engineering Guide, n.d.). 

 
Although RAG introduces dynamic context retrieval, its capabilities remain passive and 

limited to information fetching without the capacity for action or computation. To evolve 

past these limitations, external tool integration equips LLMs with the ability to interact 

with external programs through a variety of means. Zhuang et al. (2023) present their 

ToolQA benchmark, which they use to compare both standard and tool-augmented LLMs. 

The benchmark contains questions from multiple different contextual dimensions such 

as mathematical, scientific, and social, and they have integrated multiple tools into the 

tool augmented LLMs to help with these tasks. The tool-augmented LLMs had a signifi-

cantly higher success rate in the benchmark as expected due to their access to external 

data but also tended to use incorrect data sources and tool calls, an issue which could 

potentially be addressed with fine-tuning. 

 
AI agents capable autonomous of tool use and interaction with data sources have re-

cently gained significant traction, further accelerated by the function calling ability of 

OpenAI allowing LLMs to call external APIs (Huo et al., 2025; OpenAI, 2023). The ability 

to retrieve real time data, perform computations and interact with external systems ex-


21 

panded the capabilities of LLMs. The next step was a general-purpose protocol for stand-

ardized AI-tool interactions; model context protocol (MCP), introduced by anthropic late 

2024 (Huo et al., 2025; Anthropic, 2024). MCP allows AI applications and external tools 

to communicate dynamically and enables AI agents to autonomously choose and oper-

ate tools based on the task at hand. This is in contrast to traditional implementations of 

wiring APIs manually for each service and tool that the AI application interacted with as 

seen in Figure 7. MCP also improves on the abilities of RAG by enabling models to inter-

act with tools and external data actively, allowing for both retrieval and action in a uni-

fied workflow (Huo et al., 2025). 

 
Figure 8. Tool use with and without MCP compared (Huo et al., 2025). 

 
Alongside function calling increasing work is put into creating autonomous AI agents that 

can complete human tasks by navigating and using user interfaces (UIs). While a lot of 

research has focused on enabling AI agents to navigate and understand UI structures, 

there has been a gap in understanding the real-world impacts of their actions. By study-

ing and modelling the impacts of UI operations, developers can better anticipate and 

mitigate potential issues that come from AI interactions with UIs (Zhang et al., 2024). In 

their study Zhang et al. (2024) concluded that current datasets do not adequately repre-

sent the complexities of UI action interactions, especially ones with major consequences, 

nor can top LLMs consistently understand the impacts of UI actions. While AI capable of 

navigating UI made for humans is a big leap for truly autonomous AI systems, function 

calls and sandboxes might be the safer option for now. 

 
22 

As the scale and costs grow, the enhancements in LLMs start showing diminishing results 

(Metz et al. 2024). This has driven toward alternative ways to improve model perfor-

mance, such as self-refinement methods that give LLMs the ability to review and en-

hance their own outputs post training. Gou et al. (2023) demonstrate how integrating 

interactive tools within the LLM framework creates a feedback loop in which the model 

can identify and correct errors in its own responses. By integrating tool-based critiques, 

the LLM not only fixes inaccuracies but also refines its own generation process, leading 

to more reliable and context accurate outcomes. In a similar strategy Madaan et al. (2023) 

present an alternative approach, where the LLM systemically re-evaluates and revises its 

responses. This self-feedback loop allows the mode to enhance the clarity and accuracy 

of its output through each iteration. By setting its own initial output as the baseline for 

improvement, the method mitigates the issues that come from the inherent limitations 

of dataset training and compensates for the diminishing returns observed during pre-

training. 

 
23 

4 Discussion 

This chapter critically evaluates the post-training enhancement techniques and external 

tool integration approaches presented in chapter 3. The analysis focuses on identifying 

the strengths and weaknesses of each method, the issues that they solve and how they 

should be implemented. 

 
4.1 Post-training techniques analysis 

Supervised fine-tuning is useful for improving task specific performance by fine-tuning a 

base model on annotated data. This is essential for initial alignment and performance 

optimization. Reinforcement learning, especially RLHF, allows models to be safer and 

aligned with human preferences. However, RL requires carefully designed reward func-

tions and is costly to implement. RLAIF is a low-cost alternative to RLHF, with ever in-

creasing potential as LLMs evolve allowing for better AI feedback. DPO further simplifies 

alignment by training directly on preference data without an explicit reward model, im-

proving computational efficiency and stability. Direct Language Model Alignment from 

Online AI Feedback extends these ideas by generating feedback for the models’ own 

outputs in an online loop, combining both the benefits of RLHF and DAP methods. Test-

time scaling boost performance by allocating more compute for more difficult prompts 

that need it. This is good for increasing efficiency, but it doesn't correct the inherent 

issues of LLMs. Retrieval-augmented generation pulls documents from datasets, helping 

reduce hallucination and improve factual grounding.  

 
4.2 External tool and information integration 

Connecting LLMs to external tools (calculators, APIs, databases) extends capabilities be-

yond text generation, enabling precise computation, real-time data access and actions 

in software environments (Zhuang et al., 2023). Ensuring robust tool use requires careful 


24 

fine-tuning on tool calling patterns and monitoring execution results. Standardized pro-

tocols such as model context protocol can streamline integration. While external tools 

greatly expand functionality and accuracy for specific tasks, they demand engineering 

effort to build reliable pipelines and guardrails against misuse or unexpected behaviour. 

 
4.3 Integration strategies 

The main insight from analysing these enhancement techniques is that they address dif-

ferent limitations of LLMs and are most effective when combined. A comprehensive AI 

system might include multiple techniques layered on top of each other as seen in Table 

1. 

 
Table 1. Enhancement techniques divided in layers based on their function. 

Layers Techniques Function 

Foundation SFT Provides task specific alignment 

Alignment 
RLHF, RLAIF, DPO, 
OAIF Ensures safety and human preference alignment 

Knowledge RAG Incorporates external information when relevant 

Capability Tool integration, MCP 
Enables computation and action beyond text genera-
tion 

Quality Self-refinement Provides ongoing output refinement 

Efficiency Test-time scaling 
Allocates computational resources based on task com-
plexity 

 
An advanced system might use supervised fine-tuning for specializing the model to a 

task, align the model with RLHF, RLAIF, OAIF or DPO, use RAG to add context to the 

prompt if external data is needed, generate an answer and finally apply a self-refinement 

loop to verify the output. Frameworks such as MCP might also be integrated so that the 

model can better process and retrieve data trough tools. In this example pipeline each 

layer addresses a different weakness of LLMs. Fine-tuning aligns the model to the desired 


25 

purpose, the alignment techniques ensure the answer is safe and in the expected format, 

RAG fills the missing context, MCP allows tool calls and the self-refine loop refines the 

output. 

 
4.4 Practical considerations and trade-offs 

Implementing these techniques in production systems requires careful consideration of 

various trade-offs. While techniques like test-time scaling and self-refinement can im-

prove output quality, they significantly increase computational costs and latency. Organ-

izations must balance quality improvements against operational constraints. Integrating 

multiple enhancement techniques increases system complexity, potentially introducing 

new failure modes and making debugging more challenging. Many techniques require 

high-quality training data or external knowledge bases. The cost and effort of maintain-

ing these resources must be considered for implementation decisions. External tool in-

tegration and autonomous capabilities introduce new security considerations. Proper 

sandboxing, access controls, and monitoring mechanisms are essential. As systems be-

come more complex, evaluating their performance becomes more difficult. Traditional 

benchmarks may not capture the full range of capabilities and potential failure modes. 

 
26 

5 Conclusion 

As large language models (LLMs) reach the limits of what can be achieved through pre-

training, post-training techniques and external tool integration have emerged as vital 

strategies for continued advancement. This thesis has explored several methods for en-

hancing LLMs past their initial training, highlighting the importance of fine-tuning, rein-

forcement learning, test-time scaling and retrieval-augmented generation (RAG). Addi-

tionally, it examined how external tools and protocols like Model Context Protocol (MCP) 

can enhance LLMs with capabilities that go beyond static knowledge, enabling dynamic 

interaction with data and software environments.  

  
Each of the explored techniques targets different weaknesses inherent in pre-trained 

LLMs. Fine-tuning increases task specific alignment, reinforcement learning ensures 

alignment with human preferences, RAG improves factual grounding and test-time scal-

ing allows for more efficient resource management. External tools significantly extend 

the possibilities of LLMs by enabling real-time computation, information retrieval and 

interaction with the outside world. These approaches are most effective when used in 

combination, creating multi-layered systems that can compensate for individual limita-

tions and reinforce each other’s strengths. However, multiple challenges remain. The 

computational cost of combining several enhancement techniques raises questions 

about the practical scalability and cost effectiveness. Security and safety considerations 

raise in importance as models gain autonomy trough tool integration.  

 
Large language models are a relatively new development in the field of NLP, and thus 

research on the post-training enhancement of LLMs has substantial room for develop-

ment. Several open questions remain for future research. For one, research is required 

on how different enhancement techniques can be combined most effectively for specific 

applications. Research is also needed on finding the practical scalability limits of tech-

niques such as test-time scaling and self-refinement. As LLM systems become more ca-

pable, especially with the introduction of API calling and agentic capabilities, how can it 

be ensured that they remain aligned and within safe boundaries? Lastly, re-search is 


27 

needed on how the computational cost of these enhancement techniques can be mini-

mized while preserving their benefits. The fast-moving development of this field sug-

gests that new technologies and approaches will continue to emerge, requiring ongoing 

research and development efforts to understand their implications and optimal applica-

tion methods. 

 
The path forward for LLM development lies not only in scaling up model sizes, but in 

developing smarter post-training systems that integrate alignment, self-refinement, test-

time scaling, external information, external tools, and agency. By understanding and 

combining these techniques, researchers and developers can push the performance and 

reliability of LLMs to new heights making them more efficient, accurate, adaptable, and 

aligned with human goals. 

 
28 

References 

Anthropic. (2024?). Introducing the Model Context Protocol. Anthropic News. Retrieved 

June 9, 2025, from https://www.anthropic.com/news/model-context-protocol 

Bommasani, R., et al. (2021, August 16). On the Opportunities and Risks of Foundation 

Models. arXiv. Retrieved May 4, 2025, from https://arxiv.org/abs/2108.07258 

Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, 

J., Hilton, J., Nakano, R., Hesse, C., & Schulman, J. (2021, October 26). Training 

Verifiers to Solve Math Word Problems. arXiv. Retrieved June 12, 2025, from 

https://arxiv.org/abs/2110.14168 

Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y., Sun, J., & Wang, H. (2023, December 

18). Retrieval-Augmented Generation for Large Language Models: A Survey. 

Shanghai Research Institute for Intelligent Autonomous Systems. Retrieved May 

7, 2025, from https://simg.baai.ac.cn/paperfile/25a43194-c74c-4cd3-b60f-

0a1f27f8b8af.pdf 

Gou, Z., Shao, Z., Gong, Y., Shen, Y., Yang, Y., Duan, N., & Chen, W. (2023, May 19). CRITIC: 

Large Language Models Can Self-Correct with Tool-Interactive Critiquing. arXiv. 

Retrieved May 4, 2025, from https://arxiv.org/abs/2305.11738 

Guo, S., Zhang, B., Liu, T., Liu, T., Khalman, M., Llinares, F., Ramé, A., Mesnard, T., Zhao, 

Y., Piot, B., et al. (2024, February 29). Direct language model alignment from 

online AI feedback. arXiv. Retrieved June 9, 2025, from 

https://arxiv.org/pdf/2402.04792 

Holdsworth, J., & Scapicchio, M. (2024, June 17). What is deep learning? IBM THINK. 

Retrieved May 4, 2025, from https://www.ibm.com/think/topics/deep-learning 

Hou, X., Zhao, Y., Wang, S., & Wang, H. (2025, March). Model Context Protocol (MCP): 

Landscape, Security Threats, and Future Research Directions. arXiv. Retrieved 

June 9, 2025, from https://arxiv.org/pdf/2503.23278 

IBM. (2023, November 2). What are large language models (LLMs)? IBM THINK. Re-

trieved May 7, 2025, from https://www.ibm.com/think/topics/large-language-

models 

https://www.anthropic.com/news/model-context-protocol
https://arxiv.org/abs/2108.07258
https://arxiv.org/abs/2110.14168
https://simg.baai.ac.cn/paperfile/25a43194-c74c-4cd3-b60f-0a1f27f8b8af.pdf
https://simg.baai.ac.cn/paperfile/25a43194-c74c-4cd3-b60f-0a1f27f8b8af.pdf
https://arxiv.org/abs/2305.11738
https://arxiv.org/pdf/2402.04792
https://www.ibm.com/think/topics/deep-learning
https://arxiv.org/pdf/2503.23278
https://www.ibm.com/think/topics/large-language-models
https://www.ibm.com/think/topics/large-language-models


29 

Józefowicz, R., Vinyals, O., Schuster, M., Shazeer, N., & Wu, Y. (2016, February 7). Explor-

ing the limits of language modeling. arXiv. Retrieved May 7, 2025, from 

https://arxiv.org/abs/1602.02410 

Kaelbling, L. P., Littman, M. L., & Moore, A. W. (1996, May 1). Reinforcement Learning: A 

Survey. Journal of Artificial Intelligence Research. Retrieved May 7, 2025, from 

https://www.jair.org/index.php/jair/article/view/10166 

Komeili, M., Shuster, K., & Weston, J. (2021, July 15). Internet-Augmented Dialogue Gen-

eration. arXiv. Retrieved May 7, 2025, from https://arxiv.org/abs/2107.07566 

Kumar, K., Ashraf, T., Thawakar, O., Anwer, R. M., Cholakkal, H., Shah, M., Yang, M.-H., 

Torr, P. H. S., Khan, F. S., & Khan, S. (2025, March 24). LLM Post-Training: A Deep 

Dive into Reasoning Large Language Models. arXiv. Retrieved May 7, 2025, from 

https://arxiv.org/abs/2502.21321 

LeCun, Y., Bengio, Y., & Hinton, G. (2015, May 27). Deep learning. Nature. Retrieved May 

4, 2025, from https://www.nature.com/articles/nature14539 

Lightman, H., Kosaraju, V., Burda, Y., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., 

Sutskever, I., & Cobbe, K. (2023, May 31). Let’s Verify Step by Step. arXiv. Re-

trieved June 12, 2025, from https://arxiv.org/abs/2305.20050 

Madaan, A., et al. (2023, March 30). Self-Refine: Iterative Refinement with Self-Feedback. 

arXiv. Retrieved May 4, 2025, from https://arxiv.org/abs/2303.17651 

Metz, R., Ghaffary, S., Bass, D., & Love, J. (2024, November 13). OpenAI, Google and An-

thropic Struggle to Build More Advanced AI. Bloomberg Law. Retrieved May 4, 

2025, from https://www.bloomberg.com/news/articles/2024-11-13/openai-

google-and-anthropic-are-struggling-to-build-more-advanced-ai 

Muennighoff, N., Yang, Z., Shi, W., Li, X. L., Fei-Fei, L., Hajishirzi, H., Zettlemoyer, L., Liang, 

P., Candes, E., & Hashimoto, T. (2025, March 1). s1: Simple test-time scaling. arXiv. 

Retrieved June 9, 2025, from https://arxiv.org/pdf/2501.19393 

Murphy, M. (2022, May 9). What are foundation models? IBM Research Blog. Retrieved 

May 7, 2025, from https://research.ibm.com/blog/what-are-foundation-models 

OpenAI. (2024, September?). Learning to reason with LLMs. OpenAI. Retrieved June 9, 

2025, from https://openai.com/index/learning-to-reason-with-llms/ 

https://arxiv.org/abs/1602.02410
https://www.jair.org/index.php/jair/article/view/10166
https://arxiv.org/abs/2107.07566
https://arxiv.org/abs/2502.21321
https://www.nature.com/articles/nature14539
https://arxiv.org/abs/2305.20050
https://arxiv.org/abs/2303.17651
https://www.bloomberg.com/news/articles/2024-11-13/openai-google-and-anthropic-are-struggling-to-build-more-advanced-ai
https://www.bloomberg.com/news/articles/2024-11-13/openai-google-and-anthropic-are-struggling-to-build-more-advanced-ai
https://arxiv.org/pdf/2501.19393
https://research.ibm.com/blog/what-are-foundation-models
https://openai.com/index/learning-to-reason-with-llms/


30 

OpenAI. (n.d.). Function calling. OpenAI Developer Documentation. Retrieved June 9, 

2025, from https://platform.openai.com/docs/guides/function-calling?api-

mode=responses 

Prompting Guide. (2025, May?). Retrieval Augmented Generation (RAG). Prompt Engi-

neering Guide. Retrieved June 9, 2025, from https://www.promptingguide.ai/re-

search/rag 

Raschka, S. (2024, September 15). dpo-from-scratch.ipynb. GitHub. Retrieved May 7, 

2025, from https://github.com/rasbt/LLMs-from-

scratch/blob/main/ch07/04_preference-tuning-with-dpo/dpo-from-

scratch.ipynb 

Snell, C., Lee, J., Xu, K., & Kumar, A. (2024, August 6). Scaling LLM Test-Time Compute 

Optimally can be More Effective than Scaling Model Parameters. arXiv. Retrieved 

May 7, 2025, from https://arxiv.org/abs/2408.03314 

Stryker, C., & Bergmann, D. (2025, March 28). What is a transformer model? IBM THINK. 

Retrieved May 7, 2025, from https://www.ibm.com/think/topics/transformer-

model 

Stryker, C., & Holdsworth, J. (2024, August 11). What is NLP (natural language processing)? 

IBM THINK. Retrieved May 4, 2025, from https://www.ibm.com/think/top-

ics/natural-language-processing 

Zhang, Z. J., Schoop, E., Nichols, J., Mahajan, A., & Swearngin, A. (2025, March 22). From 

Interaction to Impact: Towards Safer AI Agents Through Understanding and Eval-

uating Mobile UI Operation Impacts. arXiv. Retrieved May 7, 2025, from 

https://arxiv.org/abs/2410.09006 

Zhuang, Y., Yu, Y., Wang, K., Sun, H., & Zhang, C. (2023). "ToolQA: A Dataset for LLM Ques-

tion Answering with External Tools." (“ToolQA | Proceedings of the 37th Interna-

tional Conference on Neural ...”) NeurIPS Proceedings. Retrieved May 7, 2025, 

from https://proceedings.neurips.cc/paper_files/pa-

per/2023/hash/9cb2a7495900f8b602cb10159246a016-Abstract-Da-

tasets_and_Benchmarks.html 

https://platform.openai.com/docs/guides/function-calling?api-mode=responses
https://platform.openai.com/docs/guides/function-calling?api-mode=responses
https://www.promptingguide.ai/research/rag
https://www.promptingguide.ai/research/rag
https://github.com/rasbt/LLMs-from-scratch/blob/main/ch07/04_preference-tuning-with-dpo/dpo-from-scratch.ipynb
https://github.com/rasbt/LLMs-from-scratch/blob/main/ch07/04_preference-tuning-with-dpo/dpo-from-scratch.ipynb
https://github.com/rasbt/LLMs-from-scratch/blob/main/ch07/04_preference-tuning-with-dpo/dpo-from-scratch.ipynb
https://arxiv.org/abs/2408.03314
https://www.ibm.com/think/topics/transformer-model
https://www.ibm.com/think/topics/transformer-model
https://www.ibm.com/think/topics/natural-language-processing
https://www.ibm.com/think/topics/natural-language-processing
https://arxiv.org/abs/2410.09006
https://proceedings.neurips.cc/paper_files/paper/2023/hash/9cb2a7495900f8b602cb10159246a016-Abstract-Datasets_and_Benchmarks.html
https://proceedings.neurips.cc/paper_files/paper/2023/hash/9cb2a7495900f8b602cb10159246a016-Abstract-Datasets_and_Benchmarks.html
https://proceedings.neurips.cc/paper_files/paper/2023/hash/9cb2a7495900f8b602cb10159246a016-Abstract-Datasets_and_Benchmarks.html

	1 Introduction
	2 Large Language Models
	2.1 Overview of Large Language Models
	2.2 Foundational technology
	2.2.1 Machine learning and deep learning
	2.2.2 Natural language processing and text representation
	2.2.3 Language modelling and transformer architecture
	2.2.4 Foundation models

	2.3 The function of large language models
	2.4 The importance of post-training improvement of large language models

	3 External tool usage and post-training enhancement of LLMs
	3.1 Post-training techniques
	3.2 External tool and information integration

	4 Discussion
	4.1 Post-training techniques analysis
	4.2 External tool and information integration
	4.3 Integration strategies
	4.4 Practical considerations and trade-offs

	5 Conclusion
	References