Kim Lehtinen 

Scaling a Kubernetes Cluster 

 
Vaasa 2022 

School of Technology and Innovations  
Master’s thesis in Automation and 

Computer Science  


2 

UNIVERSITY OF VAASA 
School of Technology and Innovations 
Author:    Kim Lehtinen 
Title of the Thesis:  Scaling a Kubernetes Cluster 
Degree:    Master of Science in Technology 
Programme:   Automation and Computer Science 
Supervisor:   Prof. Timo Mantere 
Instructor:   Prof. Timo Mantere 
                M.Sc. Mika Filander 
Year:    2022                                                                                     Pages: 73 

ABSTRACT: 
Kubernetes is a container orchestration tool that has become widely adopted for deploying and 
scaling containers. Devatus Oy as well as their subsidiary company Fliq Oy are interested in 
knowing how containerized applications can be scaled on Kubernetes. The objective of this the-
sis is to research how a Kubernetes cluster can be scaled as well as containerized applications 
running on Kubernetes. 
 
This thesis begins with an introduction to necessary background knowledge needed to under-
stand what Kubernetes is. Cloud computing and distributed systems are introduced, since Ku-
bernetes is a distributed system used in cloud environments for the most part. Furthermore, 
distributed applications and workloads are introduced through the concept of microservices. 
The concept of containerizing applications is thoroughly introduced to understand the runtime 
environment of the applications deployed to Kubernetes. Finally, Kubernetes architecture as 
well as its main components are introduced to understand how container orchestration works. 
 
The research on Kubernetes scalability is divided into three different parts. First part consists of 
researching how containerized applications can be scaled on Kubernetes. Second part is focused 
on how the Kubernetes cluster itself can be scaled. The final part consists of load testing one of 
Fliq’s example REST API applications deployed to a local Kubernetes cluster. The purpose of load 
testing is to gain further insight into scaling applications running on Kubernetes. Load test results 
are compared between the initial deployment configurations and after scaling the application. 
 
The load test results show that containerized applications can be scaled both vertically and hor-
izontally. Vertical scaling can be achieved by increasing the requested and limited CPU and RAM 
resources for a Pod. Horizontal scaling can be achieved by increasing Pod replicas as well as 
having a Service in front of the Pods that load balances the incoming traffic. Load test results 
show that both vertical and horizontal scaling can increase the number of users supported by 
an application deployed to Kubernetes. Scaling horizontally is preferred for Fliq’s example REST 
API since it decreased average response time and increased throughput. 
 

KEYWORDS: Kubernetes, Cloud Native, Cloud Computing, Scalability, Load testing 
  

3 

Contents 

1 Introduction 8 

1.1 Project founders 8 

1.2 Objective 8 

1.3 Structure 10 

2 Cloud Computing 11 

2.1 Introduction to Cloud Computing 11 

2.2 Distributed Systems 13 

2.3 REST API 13 

2.4 Performance testing 14 

2.4.1 Performance and load testing 14 

2.4.2 JMeter 15 

3 Cloud Native Computing 16 

3.1 Microservices 17 

3.2 Container Technology 20 

3.2.1 Introduction to containers 21 

3.2.2 Containers vs Virtual Machines 22 

3.2.3 Container image 23 

3.2.4 Container registry 24 

3.3 Kubernetes 25 

3.3.1 Pod 28 

3.3.2 Deployment 29 

3.3.3 Service 30 

3.3.4 Ingress 31 

3.3.5 Kubectl 32 

4 Scaling a Kubernetes Cluster 35 

4.1 Monitoring 35 

4.1.1 Metrics Server 35 

4.1.2 Prometheus 35 


4 

4.1.3 Grafana 36 

4.1.4 Helm 37 

4.2 Horizontal pod scaling 37 

4.3 Vertical pod scaling 39 

4.4 Cluster Autoscaler 41 

5 Test environment 44 

5.1 Cluster architecture 44 

5.2 Example application 45 

6 Scalability testing 48 

6.1 Objective 48 

6.2 Load testing 48 

6.2.1 JMeter setup 49 

6.2.2 Initial test 50 

6.2.3 Vertical scaling 53 

6.2.4 Horizontal scaling 57 

6.3 Evaluation 61 

7 Conclusion 64 

7.1 Scaling a Kubernetes cluster 64 

7.2 Limitations and future research 66 

References 68 

 
5 

Figures 
 

Figure 1. Techniques used to deploy cloud infrastructure 12 

Figure 2. Microservices 19 

Figure 3. Microservices in distributed systems 20 

Figure 4. Container vs VM 22 

Figure 5. Docker image and container flow 25 

Figure 6. Kubernetes components 27 

Figure 7. Ingress 32 

Figure 8. Horizontal Pod Autoscaler 38 

Figure 9. Cluster Autoscaler 42 

Figure 10. Cluster architecture diagram 45 

Figure 11. API endpoint used for load testing 49 

Figure 12. Concurrency thread group example 50 

Figure 13. Kubernetes resources used for initial test 51 

Figure 14. Pod details after running out of memory 52 

Figure 15. Max CPU usage for the initial load test 52 

Figure 16. Max RAM usage initial test 53 

Figure 17. Max CPU usage vertical scaling test 56 

Figure 18. Max RAM usage vertical scaling test 56 

Figure 19. Pod instances for horizontal scaling 58 

Figure 20. Max CPU usage horizontal scaling test 59 

Figure 21. Max RAM usage horizontal scaling test 60 

Figure 22. HTTP error % for initial, vertical, and horizontal scaling tests 61 

Figure 23. Average response time for initial, vertical, and horizontal scaling tests 62 

Figure 24. Throughput for initial, vertical, and horizontal scaling tests 63 


6 

  
Tables 
 

Table 1. JMeter results for initial test 51 

Table 2. JMeter results for vertical scaling test 55 

Table 3. JMeter results for horizontal scaling test 58 

 
Abbreviations 
 
API   Application Programming Interface 
CNCF   Cloud Native Computing Foundation 
CLI   Command Line Interface 
CPU   Central Processing Unit 
CRI   Container Runtime Interface 
HPA   Horizontal Pod Autoscaler 
HTTP   Hypertext Transfer Protocol 
HTTPS  Hypertext Transfer Protocol Secure 
OCI   Open Container Initiative 
OS   Operating System 
RAM   Random Access Memory 
REST   Representational State Transfer 
SQL   Structured Query Language 
VM   Virtual Machine 
VPA   Vertical Pod Autoscaler 
YAML   YAML Ain’t Markup Language 


7 

Acknowledgement 

To begin with, I would like to thank Devatus and Fliq for an interesting thesis project. I 

have learned a lot about scaling applications and container technologies. These are val-

uable skills that will be useful for my career. 

 
This thesis project would not have been possible without the people I have around me. 

I want to thank my thesis instructor and boss Mika Filander for making this thesis project 

happen and for always supporting me. Finally, I want to thank my fiancée, family, and 

friends for always being there for me. 

 
Kim Lehtinen 

Vaasa, 28.2.2022  


8 

1 Introduction 

The use of Cloud Native technologies has increased over the last few years. Instead of 

building monolithic applications and using virtual machines, companies are moving to-

wards container technologies and distributed systems. One of these technologies is Ku-

bernetes, a container orchestration tool designed to run containerized applications at 

scale. The purpose of this thesis is to study how an application running on Kubernetes 

can be scaled. 

 
1.1 Project founders 

This thesis project is done for both Devatus Oy and their subsidiary company Fliq Oy. 

Devatus Oy is a software development company, that specializes in developing digital 

services for industrial companies (Devatus 2020). In addition to digital service develop-

ment, they offer cloud solutions, data analytics and IoT solutions. 

 
Fliq Oy, on the other hand, is a software development company that specializes in smart 

factory solutions for industrial companies (Fliq 2020). Fliq Oy offers a smart factory prod-

uct called Fliq, which is a cloud-based product for companies to follow up on industrial 

processes and visualizing data that is gathered from IoT sensors. 

 
1.2 Objective 

Fliq Oy has recently moved from traditional monolithic applications to splitting their ap-

plications into smaller services, called microservices. In addition, they have chosen to 

package and run these microservices inside containers. Furthermore, they decided to 

manage and orchestrate their containers using Kubernetes.  

 
9 

Devatus Oy and Fliq Oy are interested in how to scale Kubernetes clusters. The purpose 

of this research is to understand how applications running on Kubernetes can be scaled 

and find possible defects and challenges when running applications on Kubernetes. This 

is done by studying different methods for scaling applications running on Kubernetes 

and how the cluster itself can be scaled. In addition, load testing is performed against an 

example application running in a test cluster to get a better understanding of scaling 

applications on Kubernetes. 

 
This thesis begins with an introduction to cloud computing, since Kubernetes is a tech-

nology used in cloud. Kubernetes clusters can consist of distributed servers and applica-

tions, which is why distributed systems are also introduced. REST APIs are introduced 

shortly since load tests are performed against an example REST API application running 

on Kubernetes. The concept of load testing is finally introduced to understand how it can 

be used to test the scalability of applications running on Kubernetes.  

 
Before investigating Kubernetes scalability, one must first understand core concepts of a 

technology like Kubernetes. This done with an introduction to several cloud native com-

puting concepts like microservices, container technologies and container orchestration. 

In addition, key concepts in Kubernetes are explored to further understand the technol-

ogy. 

 
In order to understand how a Kubernetes cluster can be scaled, several scaling methods 

are researched. To begin with, Kubernetes monitoring is researched to understand how 

resource metrics can be retrieved from applications running on Kubernetes. Secondly, 

scalability methods on the application level are researched to understand how applica-

tions can be scaled when running on Kubernetes. Finally, the scaling of the Kubernetes 

cluster itself is studied to understand how cluster resources can be scaled. 

 
To gain even further insight into how a Kubernetes cluster can be scaled, load tests are 

performed against an example application running in a local Kubernetes cluster. This is 


10 

achieved by first creating a local Kubernetes cluster which is used as test environment. 

Load tests are performed against an example application that is deployed to the local 

cluster. After the initial load test, scalability methods are applied, and new tests are ex-

ecuted. The load test results are analyzed to see if the scalability methods work. In addi-

tion, future ideas for scaling the application are discussed based on the scalability meth-

ods researched in this thesis and load testing results. 

 
1.3 Structure 

The second chapter of this thesis is an introduction to distributed cloud computing, 

which consists of theoretical background to understand cloud computing, distributed 

systems, REST APIs and performance testing. Chapter 3 introduces Cloud Native Compu-

ting, which consists of introduction to microservices, container technologies and Kuber-

netes. In chapter 4, the research of scaling a Kubernetes takes place. In chapter 5 the 

test environment is shown. Chapter 6 is scalability testing of an example application run-

ning on a local Kubernetes cluster. The final chapter is the thesis conclusion. 


11 

2 Cloud Computing 

Since the focus of this thesis is scalability testing of a Kubernetes cluster, one must first 

understand the building blocks of cloud computing and distributed systems. This chapter 

is an introduction to cloud computing, distributed systems, REST APIs and performance 

testing. Information presented in this chapter is also important to understand the next 

chapter, where cloud native computing is introduced. 

 
2.1 Introduction to Cloud Computing 

Cloud computing is a term heard often these days in the IT sector.  While it may sound 

like new thing, (Wang, Ranjan, Chen, & Benatallah 2011: 4–5) states that cloud compu-

ting is based on older ideas of computing, and historical changes in society. There was a 

time in history when people came up with the idea that they can create a business where 

they provide services and resources that people need and do not want to maintain them-

selves such as electricity power Wang says. Cloud computing is an effect of the same 

kind of revolution, computing power and resources can now be sold and distributed in 

the same way. 

 
According to Wang (2011: 4–5), sharing computer power and resources is also not a 

completely new invention. If one looks at the evolution of computing, the earlier ver-

sions of computers were shared among users before personal computers (PC) were in-

vented. After the Internet revolution, it made sense again to share computing resources 

to PCs and mobile phones via the Internet. 

 
Marinescu (2013: 1) says that cloud computing offers computing power and resources 

via the Internet to users in a flexible way. A user of cloud computing only has to pay for 

what is actually used. Marinescu also states that cloud computing is a successor of utility 

computing, which introduced the business idea and model of sharing computing 


12 

resources to users. The area of cloud computing began when big companies started of-

fering these kinds of services to users. 

 
Cloud computing infrastructure can be deployed in several ways. Wang (2011: 11–13) 

explains three different techniques used to deploy cloud infrastructure. The most popu-

lar technique is public cloud, which means that instead of doing cloud computing oneself, 

cloud computing is provided by other companies through the public internet, and thus 

anybody can use these resources as needed in exchange for money. The opposite to this 

is private cloud, where the cloud computing infrastructure is internal and not available 

for anyone to use or buy. Last technique described by Wang is hybrid cloud, which is a 

combination of public and private cloud, where one can choose what part of the cloud 

computing infrastructure is exposed to the public internet, and what part should remain 

private. These deployment techniques described by Wang are demonstrated in figure 1. 

 
Figure 1. Techniques used to deploy cloud infrastructure (Based on Wang 2011: 12) 

 
In addition to deployment techniques, Wang (2011: 13–14) says that there are three 

popular types of cloud services: Infrastructure as a Service (IaaS), Software as a Service 

(SaaS) and Platform as a Service (PaaS). What is common between these services is that 

users of these services only pay for what they use, and these services offer different 

types of cloud computing resources. According to Wang, IaaS offer infrastructure ser-

vices like storage or virtual machines. PaaS on the other hand, is built on top of IaaS and 

works as a platform that can be used to create new products and applications, by 


13 

providing cloud services like software testing, application deployment, databases etc. 

SaaS is software that is accessible via the Internet, usually in a web browser Wang says. 

 
2.2 Distributed Systems 

Distributed systems come in different shapes, layers and have various definitions. These 

terms are often heard in IT today, especially cloud computing. In this chapter, distributed 

systems are introduced both at hardware and software levels of computing. 

 
Marinescu (2013: 27) describes distributed systems as a set of interconnected comput-

ers. A software layer called “middleware” is used to connect computers to each other by 

exposing a network channel interface. Middleware is what glues these computers to-

gether and allows computers to share computing resources. Furthermore, Marinescu 

mentions that common characteristics of middleware are system scalability, information 

sharing, concurrency, and information accessibility. 

 
According to Puder, Römer, Pilhofer, & Romer (2005: 8), a distributed system can also be 

seen as interconnected processes, in addition to computers. A distributed system run-

ning several computer processes can run either on the same machine or multiple ma-

chines. No matter if a distributed system consists of computers or processes, they share 

a similar model where a set of nodes are connected and can communicate with each 

other. 

 
2.3 REST API 

In this thesis, an example application is used to test Kubernetes scaling methods. This 

application is deployed to a local Kubernetes cluster where it is load tested. This chapter 

explains what a REST API is since the example application is of this type. 

 
14 

There are numerous ways to design and architect applications. Applications often pro-

vide an interface through which they can be used by other systems, called Application 

Programming Interface (API) (IBM 2020). One common way to architect web-based APIs 

is to use Representational State Transfer (REST). 

 
REST APIs are web-based applications that can be consumed through HTTP/HTTPS pro-

tocol. These applications provide resources that can be accessed through an exposed API 

by following a set of rules. REST APIs follow the client-server model, where the client can 

access the API by sending HTTP requests including a URI for a specific resource that the 

API exposes. (Kanjilal 2013: 24–25). 

 
REST APIs support a set of HTTP methods. For retrieving resources, the GET method is 

usually used. When new resources are created, the POST method is used. For modifying 

resources, the PUT method is preferred. DELETE method can be used to remove re-

sources. Finally, the HEAD method can be used for accessing HTTP headers. (Kanjilal 2013: 

26). 

 
2.4 Performance testing 

Load testing is used in this thesis for testing Kubernetes scalability methods. This chapter 

introduces what performance and load testing is. In addition, JMeter which is the load 

testing tool used in this thesis is introduced. 

 
2.4.1 Performance and load testing 

Performance testing can be used to find out system performance. The system being 

tested can for example be an application, server, or network. The tests can be performed 

on a complete system or parts of it. The benefit from doing performance testing is that 


15 

it can be used to check if the system under test is able to operate in different conditions 

(Erinle 2013: 23). 

 
Load testing is one type of performance testing. It can be used to test how much load 

can be applied on the system under test (Erinle 2013: 29). If the system is an application, 

load testing can for example be used to find out the maximum number of users it is able 

to support (BlazeMeter 2019). 

 
2.4.2 JMeter 

JMeter is an open-source performance testing tool. It was first created in 1998 by The 

Apache Software Foundation. JMeter can be used to test several application types, for 

example web applications, databases, or email. Being multithreaded, JMeter is able to 

create test scenarios for high user load. (Erinle 2013: 30). 

 
New features can be added to JMeter by installing plugins. One plugin used in this thesis 

is Concurrency Thread Group. It can be used to setup concurrent threads for testing user 

load (BlazeMeter, 2016). This plugin is used in this thesis to do load tests that simulate 

high user load. 


16 

3 Cloud Native Computing 

Cloud Native Computing Foundation (CNCF) is a foundation that originated from the 

Linux Foundation, with the intention of supporting open-source cloud native projects 

(CNCF 2021a). These projects go through different stages of maturity, to help companies 

find suitable solutions (CNCF 2021b). CNCF organize conferences around the world, to 

help build a cloud native community that connects companies, software developers and 

users of the projects that CNCF supports (CNCF 2021a). In this research, several cloud 

native technologies are used that are supported by CNCF. 

 
Cloud native computing has over the recent years become popular within the cloud com-

puting landscape. The term cloud native itself has numerous definitions and can there-

fore confuse people. CNCF (2018) have their own definition for what cloud native is, and 

part of it is shown below. 

 
Cloud native technologies empower organizations to build and run scalable 
applications in modern, dynamic environments such as public, private, and hybrid 
clouds. Containers, service meshes, microservices, immutable infrastructure, and 
declarative APIs exemplify this approach. 
These techniques enable loosely coupled systems that are resilient, manageable, 
and observable. Combined with robust automation, they allow engineers to make 
high-impact changes frequently and predictably with minimal toil. 

 
To conclude, cloud native technologies are technologies that encourage its users to de-

ploy distributed software to the cloud in a fast-changing environment (CNCF 2018). 

CNCF’s definition shows that cloud native computing has derived from cloud computing, 

by encouraging features of cloud computing, for example, deployment techniques and 

latest innovations within cloud computing. Cloud native computing focuses more on get-

ting the best out of cloud computing. 

 
The rest of this chapter goes over the most basic concepts and technologies of cloud 

native technologies. First, the idea of microservices is introduced to get a better under-

standing of distributed software. Secondly, containers are introduced to understand how 


17 

to package and run software in cloud native environments. Finally, container orchestra-

tion and Kubernetes are introduced since these are core concepts of understanding the 

rest of this research. The knowledge presented in this chapter is used in later chapters 

to understand how Kubernetes clusters can be scaled. 

 
3.1 Microservices 

In this chapter, microservices architecture model is introduced. This architecture model 

aligns with the cloud native philosophy: applications should be easy to distribute and 

decoupled (CNCF 2018). This chapter compares monolithic applications with micro-

services and shows why microservices are suitable for scalable and distributed systems. 

 
For a long time, companies have faced the difficulty of scaling software and cloud infra-

structure due to the advancement of digitalization. In addition, the maintenance of 

source code has also become challenging due to applications growing larger. As a result 

of many attempts and different solutions to scale software, microservices has derived as 

an alternative software architectural design. (Newman 2015: 1–2). 

 
Traditionally, a software system is usually a monolithic application. In this simple archi-

tecture, a software system is one application only, containing all code and functionalities 

for that system. Large monolithic web services often consist of the following components 

all in the same package: frontend (web client), backend (web server) and database. 

(MuleSoft 2020). 

 
The drawback of monolithic applications is that whenever a programmer makes a change 

to any layer of a monolithic application, the whole application has to be rebuilt and re-

leased whenever a new deployment is made (MuleSoft 2020). This aligns with Newman 

saying that source code is getting harder to maintain. When all of the source code is 

found in one place only, and the application consists of many components and layers in 

the same binary, the project is harder to maintain in the long run. 


18 

 
The goal of microservices is to scale software by dividing it into smaller services, where 

each service is designed for a specific purpose or task, hence the name microservices. 

The services can for example be divided by task or business functionality in order to give 

each service a meaningful purpose. This type of division makes sure that the size of a 

microservice stays small compared to a monolithic application. (Newman 2015: 1–2). 

 
Microservices are typically APIs by design. As a result, they can work together by sending 

requests to each other. This minimalistic and practical design makes microservices ideal 

for distributed systems, since they can be deployed separately and still manage to work 

together. (Newman 2015: 3). 

 
A great advantage that comes with microservices is that the same technology doesn't 

have to be used everywhere or solve all problems. Each service can be implemented with 

the technology that is best suited to solve a specific problem. The services can still com-

municate with each other as long as they continue to communicate through their ex-

posed APIs. (Newman 2015: 4). 

 
In figure 2 an example based on one of Newman's (2015: 4) examples is demonstrated. 

Here three different microservices are shown: Posts, Users and Pictures. This could for 

example be a social media application where users can write posts and upload pictures. 

Each service is implemented with different programming language, and they use differ-

ent databases for storage. In this example, Golang was best suited for handling user re-

lations together with a SQL database. For storing posts, the combination of a Node.js API 

server and NoSQL database was the most optimal solution, and for images a Java appli-

cation paired with a blob storage was a good solution. The point of this example is to 

show that microservices open the possibilities for selecting the best technology to solve 

a specific problem (Newman 2015: 4). In addition, this example shows how a monolithic 

application can be split into microservices where each service is designed for a specific 

task, and still manage to work together. 

 
19 

 
Figure 2. Microservices (Based on Newman 2015: 4) 

 
When a monolithic application experiences a problem, all parts of that application suffer 

as a result (Newman 2015: 5). If the social media application shown in figure 5 would 

have been a monolithic application, and the “Pictures” module would experience severe 

problems, the application would not be able to handle user requests, posts, or pictures 

since the whole application is a single unit that is experiencing a problem. If the same 

application uses microservices, and the “Pictures” service experiences downtime, all 

other services are still usable. 

 
Microservices architecture is often used in scalable and distributed systems. In a mono-

lithic application one is not able to choose which part of the application should be scaled 

(Newman 2015: 5). If the social media application shown earlier was a monolith, the 

whole application would have to be scaled even if the "Posts" feature would be the only 

one experiencing performance issues. When using microservices, one is able to specify 

which service should be scaled (Newman 2015: 5). 

 
In figure 3 the scalability of microservices is demonstrated. This is the same application 

as shown in figure 2 where a social media application has three microservices. Here the 

number of instances of each service has been scaled accordingly. Microservices architec-

ture offers the possibility to replicate a specific service by deploying multiple instances. 

As a result, one can choose which part of a system has to be scaled and by how much. In 

figure 3, if the “Posts” service is the most demanding, it can be replicated three times 

for example. The “Pictures” service is the least demanding and only needs one instance. 


20 

 
Figure 3. Microservices in distributed systems (Based on Newman 2015: 6) 

 
3.2 Container Technology 

The focus of this thesis is to understand how a Kubernetes cluster can be scaled. Kuber-

netes is a platform for running and distributing applications inside containers, which is 

why one must first understand the basics of container technology before learning what 

Kubernetes is. This chapter introduces container technologies by comparing containers 

with virtual machines and the basics of Docker containers. Docker is one of the most 

popular container runtimes and often used together with Kubernetes. 

 
As of Kubernetes version 1.20, Docker as a container runtime on Kubernetes has been 

deprecated. The reason for this is that Docker provides a lot of features in addition to 

the runtime that are not necessary for Kubernetes. Docker was never intended to be 

integrated into Kubernetes, while other container runtimes are by implementing a Con-

tainer Runtime Interface (CRI). One of these is “containerd” which is the runtime that 

Docker uses under the hood. Therefore, Kubernetes decided to deprecate Docker since 

containerd can be used without Docker. (Kubernetes 2020a). 

 
21 

According to Kubernetes (2020a), even if Docker is not used as the container runtime, 

applications built using Docker will still run on Kubernetes since CRI-compliant container 

runtimes uses OCI (Open Container Initiative) images to run containerized applications. 

Applications built and packaged using Docker produces OCI images and will therefore 

run on Kubernetes (Kubernetes 2020a). 

  
Docker provides a good technology for packaging software into OCI-compliant container 

images and running containers locally. Therefore, Docker is used in thesis to explain con-

tainer technology concepts. Chapter 3.2.1 is an introduction to what containers are. In 

chapter 3.2.2 containers are compared with virtual machines (VM) in order to under-

stand the difference between these two popular virtualization technologies. 

 
3.2.1 Introduction to containers 

Containers became popular and more accessible in the last decade when container tech-

nologies like Docker were introduced (D2iQ 2018). However, the technology and idea of 

packaging and running software inside isolated containers is older (D2iQ 2018). Below is 

how Google Cloud (2022) describes what containers are. 

 
Containers offer a logical packaging mechanism in which applications can be 
abstracted from the environment in which they actually run. This decoupling allows 
container-based applications to be deployed easily and consistently, regardless of 
whether the target environment is a private data center, the public cloud, or even 
a developer’s personal laptop. 

 
According to Google Cloud’s (2022) definition of what containers are, container technol-

ogies offer a more universal way of packaging and running applications across environ-

ments. All of the source code and libraries that is needed to run an application can be 

put inside the container (Docker 2021a). To conclude, containers can be used to isolate 

and distribute software. 

 
22 

3.2.2 Containers vs Virtual Machines 

In this thesis both containers and virtual machines are used. Containers are used to run 

containerized applications on a Kubernetes cluster, and virtual machines are used to cre-

ate a local Kubernetes cluster for scalability testing in chapter 6. In this chapter these 

two virtualization technologies are compared to understand the difference between 

them. 

 
Figure 4 shows the difference between virtual machines and containers. Starting from 

the lowest level, both virtualization technologies are dependent on an underlying infra-

structure in the form of a computer together with an operating system running on it 

(Poulton 2020: 71). The first difference in VMs is that they are dependent on a hypervisor 

to virtualize physical resources (Poulton 2020: 73). Containers don't need a hypervisor 

since the virtualization happens on the operating system (OS) level (Poulton 2020: 73). 

The final difference between containers and VMs are that an OS has to be installed in 

each VM, while multiple containers can use the same host OS (Poulton 2020: 73). 

 
Figure 4. Container vs VM (Kubernetes 2021a). License: CC BY 4.0. 

 
Figure 4 also shows the benefit of separating applications from each other either using 

VMs or containers. Before these technologies, all applications were running on the same 

server. The only way to truly separate applications was to add more servers, which would 

result in unused resources. VMs solved this problem by creating several VMs on the 

https://github.com/kubernetes/website/blob/main/LICENSE


23 

same host. Containers are able to solve the same problem with less overhead, since con-

tainers can use host OS. (Kubernetes 2021a). 

 
Containers take less time to create due to VMs having to install an entire OS each time 

during creation (Poulton 2020: 74). The computer on which the containers will be run-

ning on has a running OS ready for use (Poulton 2020: 74). Poulton (2020) concludes 

containers as a more cost-effective solution followingly "You can pack more applications 

onto less resources, start them faster, and pay less in licensing and admin costs, as well 

as present less of an attack surface to the dark side" (Poulton 2020: 74). These benefits 

can make containers a compelling option when deciding how to run and scale software. 

 
3.2.3 Container image 

Container images are needed in order to create and run containers. They contain every-

thing that is needed to run an application in a container. It is common for container im-

ages to build upon other images called “base images”. (Microsoft 2021). 

 
When using Docker as the technology to create container images, a file named 

"Dockerfile" is used. This file can be used to tell Docker step by step how a docker image 

should be created. Each step in the Dockerfile can be thought of as a command to tell 

Docker what to do. (Docker 2021b). 

 
Below is an example of a Dockerfile code created by Docker (2021b). The first command 

is "FROM", which tells Docker to base the new container image upon ubuntu container 

image. The application source code is copied into the container image using the "COPY" 

command. The "RUN" command compiles the application code. Finally, the "CMD" 

command starts the application when the container has been created. (Docker 2021b). 

 
# syntax=docker/dockerfile:1 

FROM ubuntu:18.04 

COPY . /app 


24 

RUN make /app 

CMD python /app/app.py 

 
The example above complies with Microsoft (2021) definition of a container image. This 

example container image uses ubuntu v18.04 as base container image. An application 

together with its dependencies is copied into the container image, and the application 

is compiled inside the container image. The base image has python installed, which is 

used to run the main application file. Therefore, it can be concluded that this container 

image includes everything needed to run the application. 

 
3.2.4 Container registry 

When a container image has been built, it can be run locally. However, in order to let 

other people and servers access the same container images, a container registry is 

needed. Container registries can be used to accumulate and distribute container images 

(Poulton 2020: 51). 

 
In figure 6, the flow of creating and running container images using a container registry 

is shown. Everything starts with having an application that should be built and deployed. 

A “Dockerfile” is used to package the application source code and its libraries into a con-

tainer image (Poulton 2020: 89). In order to be able to distribute the container image, it 

is sent to a container registry where it will be stored and is accessible by others. 

 
25 

 
Figure 5. Docker image and container flow (Based on Poulton 2020: 89) 

 
Using container registries allows another person or machine to access the same con-

tainer image someone else has built. If the intent is to deploy the application to a pro-

duction server, the server can pull the container image from the container registry and 

run the application inside a container, without having to understand how the application 

should be built. The container image can be executed as it is, including everything that 

is needed to run the application. 

 
3.3 Kubernetes 

In the previous chapter containers as a technology was introduced to explain what they 

are and what they do. In this chapter, the container orchestration technology used 

throughout this thesis is introduced, called Kubernetes. The goal of this thesis is to un-

derstand how applications can be scaled on Kubernetes. Before learning how to scale a 

cluster, one must first understand what container orchestration is and what Kubernetes 

does. That is the focus of this chapter. 

 
Managing container workloads without a container orchestration tool is possible. How-

ever, this becomes harder at a larger scale when containers have to be scalable and dis-

tributed on multiple servers. This is where a container orchestration tool like Kubernetes 


26 

comes in to solve this problem. Kubernetes (2021a) describes Kubernetes as “…a 

portable, extensible, open-source platform for managing containerized workloads and 

services, that facilitates both declarative configuration and automation”. 

 
Kubernetes is an open-source project that has derived from a container orchestration 

tool at Google called “Borg” (Kubernetes 2015). Google has used Borg to manage clus-

ters and containerized applications for many years (A. Verma, L. Pedrosa, M. Korupoly, 

D. Oppenheimer, E. Tune & J. Wilkes 2015). The lessons learned from building Borg and 

other orchestrations tools at Google have been used to create Kubernetes (B. Burns, B. 

Gant, D. Oppenheimer, E. Brewer & John Wilkes, 2016). 

 
With the help of Kubernetes, it is possible to build a cluster consisting of multiple servers. 

Kubernetes takes care of managing and distributing container workloads in the cluster. 

The system administrator can tell what Kubernetes should do, by defining something 

called the “desired state” (Kubernetes 2021a). Kubernetes takes care of keeping the clus-

ter in the desired state by comparing its actual state (Kubernetes 2021a). This is how 

Kubernetes manages to achieve things like automatic deployments, rollbacks, and self-

healing (Kubernetes 2021a). 

 
In figure 6, a high-level architecture of Kubernetes is shown. This architecture overview 

shows the main components in a Kubernetes cluster. This figure shows that Kubernetes 

forms a cluster by joining multiple server nodes together. These nodes can be of any 

machine type that has a container runtime installed, which allows for Kubernetes to be 

installed both in the cloud and on bare metal servers (R. Muddinagiri, S. Ambavane & S. 

Bayas 2019: 240). This is makes Kubernetes a portable container orchestration technol-

ogy. 

 
27 

 
Figure 6. Kubernetes components (Github 2020a). License: CC BY 4.0. 

 
One of the nodes in a Kubernetes cluster is assigned to be the control plane. Its job is to 

manage the whole cluster, making sure that the actual state matches the desired state. 

The first key component in the control plane is the “kube-apiserver”, which serves as a 

gateway to Kubernetes API. The “kube-scheduler” component takes care of creating and 

distributing containers to available nodes. The “kube-controller-manager” is the compo-

nent that makes sure the actual state matches the desired state. The “etcd” component 

is the database where the desired state of the cluster is stored. Finally, the “cloud-con-

troller-manager” is an optional component that can be used to integrate the cluster with 

a cloud vendor’s API. (Kubernetes 2021m). 

 
In figure 6, the “Kubernetes nodes” are the remaining worker nodes in a cluster, where 

the actual workload is happening. The containers deployed to a Kubernetes cluster are 

running in something called a “Pod”, which are introduced in chapter 3.3.1. Each worker 

node has a container runtime installed, making it possible to run container workloads 

(Kubernetes 2021m). In addition, they have a component called “kubelet”, which Kuber-

netes (2021a) describes as “An agent that runs on each node in the cluster. It makes sure 

that containers are running in a Pod.” (Kubernetes 2021a). The last component in a 

worker node is “kube-proxy”, which is responsible for networking (Kubernetes 2021a). 

 
https://github.com/kubernetes/website/blob/main/LICENSE
https://kubernetes.io/docs/concepts/architecture/nodes/
https://kubernetes.io/docs/concepts/containers/
https://kubernetes.io/docs/concepts/workloads/pods/


28 

The term object in Kubernetes is used to describe units in the cluster (Kubernetes 2021b). 

They can be used for example to configure the cluster or running applications (Kuber-

netes 2021b). Objects can be configured either imperatively or declaratively (Kubernetes 

2021c). The rest of this chapter introduces common objects in Kubernetes and how they 

can be configured. 

 
3.3.1 Pod 

The Pod object is the lowest level object in Kubernetes. Pods can consist of several con-

tainers. However, there is usually only one container in each Pod. The possibility to have 

multiple containers in one Pod can be useful in some situations, for example if they are 

highly dependent on each other and always coexist. (Kubernetes 2021d). 

 
Pods are usually never deployed separately. Kubernetes has other objects specialized for 

both creating and managing Pods for different scenarios. Examples of these are: Deploy-

ment, Job, StatefulSet and DaemonSet. (Kubernetes 2021d). 

 
Kubernetes objects can be created using only the command line. However, usually they 

are created using YAML files (Kubernetes 2021b). The code example below by Kuber-

netes (2021e) shows how a Pod manifest YAML file can look like. The kind field specifies 

object type, metadata works as identification, and spec specifies the desired state for 

the Pod (Kubernetes 2021b). In the code example below, the Pod consists of one con-

tainer, running a NGINX web server container image. 

 
apiVersion: v1 

kind: Pod 

metadata: 

  name: static-web 

  labels: 

    role: myrole 

spec: 

  containers: 

    - name: web 

      image: nginx 

      ports: 

        - name: web 


29 

          containerPort: 80 

          protocol: TCP 

 
3.3.2 Deployment 

The Deployment object in Kubernetes is used to deploy containerized applications in 

Pods. Common purposes for the object are to create, edit, or delete Pods running a par-

ticular application. In addition, the Deployment object supports rolling back to a previ-

ous version or scaling up the number of Pod instances running an application. (Kuber-

netes 2021f). 

 
Below is an example of what a Deployment looks like by Kubernetes (2021f). The replicas 

field will create an object of type ReplicaSet that takes care of making sure that 3 Pod 

instances are running. The spec field decides what container image should be deployed 

to the Pods. The label fields under metadata and selector tells the Deployment which 

Pods to administer. (Kubernetes 2021f). 

 
apiVersion: apps/v1 

kind: Deployment 

metadata: 

  name: nginx-deployment 

  labels: 

    app: nginx 

spec: 

  replicas: 3 

  selector: 

    matchLabels: 

      app: nginx 

  template: 

    metadata: 

      labels: 

        app: nginx 

    spec: 

      containers: 

      - name: nginx 

        image: nginx:1.14.2 

        ports: 

        - containerPort: 80 

 
30 

3.3.3 Service 

The Kubernetes Service object is “An abstract way to expose an application running on 

a set of Pods as a network service.” (Kubernetes 2021g). The Kubernetes cluster gives a 

Service object a DNS-name, that can be used to discover a group of Pods. In addition, 

the Service object can load-balance traffic between multiple Pods. (Kubernetes 2021g). 

 
When Pods are deployed to a Kubernetes cluster, they get their own IP address. The 

problem with using these IP addresses is that Pods are mortal. If a Deployment’s desired 

state changes, Pods might get deleted as result since Kubernetes always has to make 

sure that the actual state matches the desired state. The Service object solves this prob-

lem by acting as a portal to a group of Pods. (Kubernetes 2021g). 

 
The code example below by Kubernetes (2021g) shows what a Service object can look 

like. The selector field is used to tell Kubernetes that this Service belongs to all Pods with 

the same label key value. The “targetPort” field specifies on which TCP port the Pods are 

listening on, and “port” is the port the Service listens on. (Kubernetes 2021g). 

 
apiVersion: v1 

kind: Service 

metadata: 

  name: my-service 

spec: 

  selector: 

    app: MyApp 

  ports: 

    - protocol: TCP 

      port: 80 

      targetPort: 9376 

 
There are different types of Services in Kubernetes. The service type can be specified by 

adding type field under the spec field in a manifest file. The default type is ClusterIP, 

which gives the Service an internal IP address that can’t be accessed from outside the 

cluster. The NodePort type can be used to open a port on each node that makes the 

Service externally accessible. The LoadBalancer type creates a load balancer for the 

cloud provider the cluster is using, which makes the service accessible to anyone. (Ku-

bernetes 2021g). 

https://kubernetes.io/docs/concepts/workloads/pods/


31 

 
3.3.4 Ingress 

Ingress is a concept in Kubernetes designed to control HTTP traffic between the cluster 

and the outside world. It can for example be used to forward incoming traffic to a specific 

Service. The configuration for how and where the incoming traffic is forwarded can be 

done by creating Ingress rules. The Ingress rules can be used to specify where a set of 

URL paths and hosts should be forwarded. (Kubernetes 2021h). 

 
In order for Ingress rules to be applied, the cluster must have at least one Ingress con-

troller installed that takes care of actually doing what has been specified in the Ingress 

rule (Kubernetes 2021h). Ingress controllers are not installed by default in the cluster 

(Kubernetes 2021h). This makes it possible to select the most suitable Ingress controller 

for a specific Kubernetes cluster. According to Kubernetes (2021h), any Ingress controller 

should work in theory. This separation between Ingress rules and Ingress controllers ab-

stracts away the underlying technology that takes care of doing the actual the work. 

 
The code example below based on Kubernetes (2021h) shows what an Ingress rule man-

ifest can look like. The rules array field allows for multiple rules to be defined. In this case, 

there is a rule that the domain “example.com” should point to a Service named “exam-

ple-service” at port 80. The path field can be used to route a specific path only to a Ser-

vice. However, in this example the path is set to “/” which means the rule is applied to 

all paths. 

 
apiVersion: networking.k8s.io/v1 

kind: Ingress 

metadata: 

  name: ingress-example 

spec: 

  rules: 

  - host: example.com 

    http: 

      paths: 

      - pathType: Prefix 

        path: "/" 

        backend: 


32 

          service: 

            name: example-service 

            port: 

              number: 80 

 
In figure 7 the purpose of having an Ingress is demonstrated. Ingress works as the bridge 

between HTTP clients and the applications running on Kubernetes (Kubernetes 2021h). 

HTTP requests first goes through the Ingress and is forwarded to a Service based on In-

gress rules. One rule could for example be to forward traffic to a specific Service based 

on the domain name “example.com” as the manifest example above. The Service finally 

forwards the traffic to one of the Pods it exposes. This example is an end-to-end demon-

stration of how applications running in Pods can be exposed to the outside world with 

the help of Service and Ingress objects in Kubernetes. 

 
Figure 7.  Ingress (Kubernetes 2021h). License: CC BY 4.0. 

 
3.3.5 Kubectl 

Kubernetes clusters can be managed using kubectl CLI. It can for example be used for 

creating or editing resources. In addition, a common use case is to get information about 

resources running in a Kubernetes cluster. (Kubernetes 2021i). 

 
In order for kubectl to be able to interact with a Kubernetes cluster, a kubeconfig file is 

needed. This file is used to switch between clusters and perform the needed authenti-

cation to be able to interact with Kubernetes API running in the cluster, using kubectl. It 

https://github.com/kubernetes/website/blob/main/LICENSE


33 

is possible to have multiple kubeconfig files and tell kubectl which one to use. (Kuber-

netes 2021j). 

 
Kubernetes objects can be managed using kubectl either imperatively or declaratively. 

The imperative approach is to perform specific kubectl commands to manage resources. 

Below is two imperative kubectl command examples by Kubernetes (2021c). Both exam-

ples deploy the same application in two different ways. The first example uses com-

mands only for achieving the deployment, called imperative commands. The second ex-

ample uses both commands and a YAML manifest file where a Deployment object has 

been described, this is called imperative object configuration. Both examples are imper-

ative since kubectl is told specifically to create something. (Kubernetes 2021c). 

 
kubectl create deployment nginx --image nginx 

 
kubectl create -f nginx.yaml 

 
The declarative approach does not tell kubectl specifically what to do, this is called de-

clarative object configuration. Instead, kubectl is only given YAML files or directories con-

taining YAML files to process. Kubernetes automatically knows what to do based on the 

contents of the manifest files. Below is an example of a declarative approach by Kuber-

netes (2021c). All manifest files inside a “configs” directory will be applied by kubectl. If 

an object described in a manifest file does not exist, kubectl will automatically create 

that object. However, if the object already exists and the manifest file has changed, ku-

bectl will automatically update that object. (Kubernetes 2021c). 

 
kubectl apply -f configs/ 

 
All approaches for managing Kubernetes objects with kubectl have their pros and cons. 

Imperative commands are simple and fast to execute. However, the changes are not de-

scribed anywhere and cannot be reused. Imperative object configuration solves these 

shortcomings by describing the actions to be taken in manifest files. The drawback of 

this approach is that it is more laborious compared to a few writing commands. 


34 

Declarative object configuration is better for applying folders containing manifest files 

and knowing what to do with them. The drawback of using the declarative approach is 

knowing why something is not working when a lot of changes have been applied. (Ku-

bernetes 2021c). 


35 

4 Scaling a Kubernetes Cluster 

In this chapter, different ways of scaling a Kubernetes cluster are studied. These scaling 

methods are studied to learn how applications can be scaled on Kubernetes. This chapter 

begins by studying how a Kubernetes cluster can be monitored in order to understand 

how metrics can be retrieved. Vertical and horizontal scaling are studied to understand 

different ways of scaling Pods. Cluster autoscaler is studied to understand how the server 

nodes can be scaled. 

 
4.1 Monitoring 

This chapter introduces how a Kubernetes cluster can be monitored. It introduces com-

ponents and technologies that can be installed in a cluster for scaling purposes. These 

technologies are used in the scalability testing chapter. 

 
4.1.1 Metrics Server 

Kubernetes Metrics Server is a service that can be installed in a Kubernetes cluster to get 

information about cluster resources. The Metrics Server is able to retrieve this infor-

mation from the kubelet component which is found on all nodes in a cluster. This data 

can be accessed via Kubernetes API server which is extended by an additional Kuber-

netes Metrics API. The Metrics Server is designed for autoscaling purposes only. It has to 

be installed in order to use Horizontal Pod Autoscaler and Vertical Pod Autoscaler scaling 

methods. (Oracle 2021). 

 
4.1.2 Prometheus 

Brazil (2018) describes Prometheus as “…an open source, metrics-based monitoring sys-

tem”. It was originally created by Sound Cloud, and today it is part of CNCF. Prometheus 


36 

is built for unifying metrics from multiple data sources. It can for example retrieve met-

rics from applications, servers, and other monitoring systems. In the case of Kubernetes, 

it can automatically find nodes and applications to retrieve metrics from. The metrics 

can be used as data source for visualization tools like Grafana. (Brazil 2018: 3–4). 

 
Prometheus is able to discover Kubernetes objects and nodes to retrieve metrics from 

through the Kubernetes API server. All nodes in a Kubernetes cluster have a kubelet com-

ponent which is used to retrieve metrics about nodes. For applications running in Kuber-

netes, Prometheus is able to scrape all exposed container ports inside a pod. (Brazil 2018: 

159–166). 

 
Prometheus is used in this thesis to retrieve metrics from applications running in a Ku-

bernetes cluster. These metrics are used as data source in Grafana dashboards for visu-

alizing load testing results. 

 
4.1.3 Grafana 

Grafana is a tool designed for data analytics at scale. It is open-source, flexible and easy 

to integrate with various data sources and other monitoring tools. When installed, it of-

fers a dashboard that can be used to visualize and analyze data. (Shivang 2019). 

 
When Kubernetes clusters are monitored using tools like Prometheus, a lot of data is 

collected about the cluster, nodes, and application workloads. In order to analyze this 

data, Grafana is used in this thesis as a data visualization tool. The Grafana dashboard is 

used to analyze Prometheus metrics collected when load testing is performed. 

 
37 

4.1.4 Helm 

Helm is a tool designed for packaging applications together with their configurations for 

Kubernetes. This is done by creating Helm Charts that packages all needed YAML config-

uration files for a specific application. These charts can be versioned and distributed via 

repositories. This makes it easier to install applications on Kubernetes clusters. (D. Mer-

ron & T. Idowu 2020). 

 
Helm is used in this thesis to help installing Prometheus and Grafana. These tools are 

often used together in Kubernetes for monitoring a cluster. There are several helm pack-

ages available for installing Prometheus and Grafana in a Kubernetes cluster. 

 
4.2 Horizontal pod scaling 

Scaling horizontally means to increase the number of compute instances. To begin with, 

one can start off by running only one instance, and add more instances later when there 

is more demand. The advantage of doing horizontal scaling is that if an instance is expe-

riencing issues, other instances are not affected and can continue to function. (Techope-

dia 2021a). 

 
When Pods are scaled horizontally, the HTTP traffic can be load balanced between the 

Pods. This is possible by having a Kubernetes Service in front of the Pods (Kubernetes 

2021g). The benefit of sharing the load between multiple instances is that it can decrease 

HTTP response time (S. Jain & A. K. Saxena 2016). In addition, it can enhance throughput 

(D. Sharma 2018). 

 
In the Kubernetes world, horizontal scaling is done through increasing the amount of 

Pod instances (Kubernetes 2021k). The first way to scale horizontally is to manually in-

crease the number of Pods by changing the replicas field in a Deployment object for 


38 

example (Kubernetes 2021f). When Pods are scaled manually, the number of Pods is 

fixed. 

 
Scaling Pods horizontally can be automated by using Horizontal Pod Autoscaler (HPA). 

This is done by specifying a condition for when the Pods should be scaled. The condition 

for when Pods should be scaled can for example be decided by how much CPU or RAM 

the current Pods are using. In addition, it is possible to define even more customized 

conditions for when to scale Pods. (Kubernetes 2021k). 

 
In figure 8, the HPA concept is shown. It can be implemented by creating a Horizon-

talPodAutoscaler object. This HPA object is linked with a Deployment object through 

which it is able to scale the number of Pods. The HPA scales the number of Pods by ed-

iting the replicas field in the Deployment object. Since the Deployment object in Kuber-

netes has a controller that makes sure the actual number of Pods is equal to the desired 

state, the changes will automatically take effect. (Kubernetes 2021k). 

 
Figure 8.  Horizontal Pod Autoscaler (Kubernetes 2021k). License: CC BY 4.0. 

 
Kubernetes is able to automate horizontal scaling through a controller. This controller 

has a time interval for checking if Pods should be scaled. Each time the controller runs, 

https://github.com/kubernetes/website/blob/main/LICENSE


39 

it compares the desired metrics in the HPA object with the actual metrics at that moment. 

If the condition defined in the HPA object is based on CPU or RAM usage, it retrieves the 

Pod metrics from Resource Metrics API. However, if the condition for scaling is a custom 

one, the metrics are retrieved from Custom Metrics API. (Kubernetes 2021k). 

 
If there is not enough Pods running to meet the desired state defined in an HPA object, 

Kubernetes will increase the number of Pods. However, if there are more Pods then 

needed running, Kubernetes will decrease the number of Pods to a degree where the 

desired state is still fulfilled. This is achieved by continuously calculating an optimal 

amount of Pod replicas to meet the desired state. (Kubernetes 2021k). 

  
4.3 Vertical pod scaling 

Vertical scaling is to increase resources on a compute instance (Techopedia 2021b). This 

can for example mean to increase a server’s RAM or CPU (Techopedia 2021b). According 

to Section (2020), the advantage of scaling vertically is that it is simpler than horizontal 

scaling in terms of not having to think about how to connect multiple compute instances. 

However, the disadvantage of scaling a compute instance vertically is that it often re-

quires downtime (Section 2020). 

 
In Kubernetes, Pods can be scaled vertically by specifying how much resources the con-

tainers in a Pod can use. Usually this is done by configuring CPU and RAM for the con-

tainers. How much resources is needed to run a container can be specified by setting a 

resource request. Kubernetes scheduler selects a node to deploy the Pod to using this 

information. The maximum amount of resources a container can use can be specified by 

setting resource limit. (Kubernetes 2021l). 

 
Below is an example by Kubernetes (2021l) on how resources for containers in a Pod can 

be managed. This Pod runs two different containers with their own resource requests 

and limits. In this case, both containers have the same amount of resources. They require 


40 

a minimum of 64 MiB of RAM and 250m CPU, which is specified in the requests field. 

The containers can’t use more than 128MiB of RAM and 500m CPU, specified in the limits 

field. The “m” unit for CPU stands for millicpu, where 1000m is equivalent to 1 CPU core 

(Kubernetes 2021l). Pods can be scaled vertically by changing container resources as in 

this example. 

 
apiVersion: v1 

kind: Pod 

metadata: 

  name: frontend 

spec: 

  containers: 

  - name: app 

    image: images.my-company.example/app:v4 

    resources: 

      requests: 

        memory: "64Mi" 

        cpu: "250m" 

      limits: 

        memory: "128Mi" 

        cpu: "500m" 

  - name: log-aggregator 

    image: images.my-company.example/log-aggregator:v6 

    resources: 

      requests: 

        memory: "64Mi" 

        cpu: "250m" 

      limits: 

        memory: "128Mi" 

        cpu: "500m" 

 
The second method for scaling Pods vertically is to use Vertical Pod Autoscaler (VPA). It 

can be used to automate the management of Pod resources. VPA can automatically scale 

the Pods vertically by changing RAM and CPU requests or limits for the containers in a 

Pod. It can automatically find optimal resources for Pods when the load changes. If a 

Pods has too little resources, VPA can automatically add more resources. If the Pod has 

too much resources, VPA automatically reduces Pod resources. (Github 2021a). 

 
Below is an example by (Github 2021a) on how to create a VPA manifest. In this example 

the VPA object is created to control the containers created by a Deployment object. In 

this example the VPA runs in an “Auto” update mode. This mode allows the VPA object 

to automatically change pod resources at any time of a Pod’s lifecycle. If the mode is set 


41 

to “Initial”, VPA is only allowed to change resources when a new Pod is initialized. Finally, 

if the mode is set to “Off”, VPA can’t change Pod resources. Instead, the “Off” mode can 

only provide the information about what the optimal resources would be for the Pod. 

(Github 2021a). 

 
apiVersion: autoscaling.k8s.io/v1 

kind: VerticalPodAutoscaler 

metadata: 

  name: my-app-vpa 

spec: 

  targetRef: 

    apiVersion: "apps/v1" 

    kind:       Deployment 

    name:       my-app 

  updatePolicy: 

    updateMode: "Auto" 

 
4.4 Cluster Autoscaler 

Kubernetes has a Cluster Autoscaler tool designed to automatically scale a cluster when 

resource requirements change. The Cluster Autoscaler automatically adds new server 

nodes if Kubernetes fails to find a node with enough resources to run new Pods. On the 

contrary, Cluster Autoscaler removes unnecessary nodes when there are more node re-

sources than needed to run the current workload. (Github 2021b). 

 
In figure 9, the first gray box shows a scenario where Cluster Autoscaler scales up. The 

cluster initially consists of three nodes, with four Pods running in the first node. The re-

maining two nodes have three Pods each running. On the left-hand side in this scenario, 

there are three scheduled Pods ready to be deployed to the cluster. However, since there 

is not enough resources to run these Pods on the current nodes, ClusterAutoscaler de-

ployed one of the Pods to an existing node, added a new node to the cluster, and finally 

deployed the two remaining Pods to the new node. When new Pods have been sched-

uled for Kubernetes to run and there are not enough resources on any node to run them, 

Cluster Autoscaler automatically adds new nodes in order to run the new Pods (Google 

Cloud 2020). 


42 

 
Figure 9.  Cluster Autoscaler (Google Cloud 2020). License: CC BY 4.0. 

 
In figure 9, the scaling down case for Cluster Autoscaler is demonstrated in the second 

gray box. In this case, the initial cluster consists of four nodes. The first node has four 

Pods running, the second node has three Pods, third node has one, and the fourth has 

two Pods running. The Cluster Autoscaler decreases the number of nodes if the nodes 

have enough unused resources (Google Cloud 2020). On the left-hand side, node 3 has 

a lot of unused resources since only one Pod is running. The Pod running in node 3 is 

able to fit inside node 4, which means that the Cluster Autoscaler can combine Pods in 

node 3 and 4 to the same node. In this example, the Pod on node 3 is moved to node 4 

and the Cluster Autoscaler removed node 3 as shown on right-hand side. As a result, the 

cluster is able to run the same Pods using less resources automatically. 

 
The Cluster Autoscaler can be used together with other autoscaling tools, for example 

HPA. When using HPA, the number of Pods scales automatically up or down based on 

              
https://creativecommons.org/licenses/by/4.0/


43 

load. The Cluster Autoscaler can automatically add or remove nodes in the cluster based 

on the changing number of Pods controlled by HPA. (Github 2021b).  

 
44 

5 Test environment 

In order to be able to run load tests, a test environment has to be built. This environment 

is a local Kubernetes cluster, meaning that the cluster is running on the same local PC 

used for load testing. This cluster is not a production ready cluster supposed to replicate 

a proper Kubernetes cluster running in the cloud. This local cluster is only used to 

demonstrate application scaling on Kubernetes. 

 
5.1 Cluster architecture 

The test environment used for scalability testing is a local Kubernetes cluster consisting 

of virtual machines. This cluster is built using Venkat Nagappan’s Github project1 that 

creates a local Kubernetes consisting of three nodes. The virtual machines are created 

using Vagrant, which is a tool for creating virtual machines. 

 
In figure 10, the cluster architecture for the test environment is demonstrated. Starting 

from the bottom, the underlying infrastructure for the cluster is three virtual machines 

with Linux Ubuntu 20.04 OS installed on them. The first virtual machine is assigned to be 

the control plane, having 2 CPU cores and 2560MB of memory. The other two are com-

puting machines, consisting of 1 CPU core and 2056MB memory each. On top of the 

infrastructure Kubernetes has been installed to join the nodes into one complete Kuber-

netes cluster. The highest level of the architecture diagram shows the Kubernetes re-

sources used to run the example application on this cluster. The example application is 

deployed using the Deployment object that takes care of running the application inside 

Pods. The Service object load balances the traffic between the Pods and takes care of 

exposing the Pods. In order for outside traffic to able to interact with the application, an 

Ingress controller is installed and an Ingress rule is created to point the local domain 

 
1 Venkat Nagappan’s Github project for building a local Kubernetes cluster: https://github.com/justmean-
dopensource/kubernetes/tree/master/vagrant-provisioning  


45 

“backend-go.fliq.test” to the Service that exposes the example application Pods. This lo-

cal domain name can be used to interact with the example application, allowing the load 

test HTTP requests to reach the application under test. 

  
Figure 10.  Cluster architecture diagram 

 
5.2 Example application 

In this chapter, the Kubernetes resources needed to run the example application are cre-

ated. These are shown in figure 10, the highest level of the cluster architecture diagram. 

The needed resources are Deployment, Service, and Ingress. 

 
46 

The initial Deployment manifest is shown below. This Deployment object deploys one of 

Fliq’s example REST API applications built using Go programming language. This Deploy-

ment deploys one Pod replica that runs the REST API in a container inside the Pod. The 

image field specifies that the container image should be pulled from the container reg-

istry where Fliq’s example application is located. Since the REST API server is listening on 

port 8080, the same port on the container itself must be exposed. The container re-

quests at least 100m of CPU and 128Mi of RAM. If there are enough resources in the 

cluster, the container is limited to use a maximum of 200m CPU and 256Mi RAM if 

needed. 

 
apiVersion: apps/v1 

kind: Deployment 

metadata: 

  name: fliq-backend-go 

  labels: 

    app: fliq-backend-go 

    component: backend-go 

spec: 

  replicas: 1 

  selector: 

    matchLabels: 

      app: fliq-backend-go 

      component: backend-go 

  template: 

    metadata: 

      name: fliq-backend-go 

      labels: 

        app: fliq-backend-go 

        component: backend-go 

    spec: 

      containers: 

        - name: backend-go 

          image: fliqreg.azurecr.io/backend-go/local 

          imagePullPolicy: Always 

          ports: 

            - containerPort: 8080 

          resources: 

            requests: 

              cpu: 100m 

              memory: 128Mi 

            limits: 

              cpu: 200m 

              memory: 256Mi 
 

In order to expose the Pods created by the Deployment object, a Service is created as 

shown below. This Service is of type ClusterIP, meaning that its IP address is only reach-

able within the cluster. The Service itself is reachable on port 80. However, the target 


47 

port is set to the same number as the example application container port specified in 

the Deployment manifest, port 8080. In order for this Service to find the Pods running 

the example application, the same selector fields are used as in the Deployment object. 

 
apiVersion: v1 

kind: Service 

metadata: 

  name: fliq-backend-go 

  labels: 

    app: fliq-backend-go 

    component: backend-go 

spec: 

  type: ClusterIP 

  ports: 

  - port: 80 

    protocol: TCP 

    targetPort: 8080 

  selector: 

    app: fliq-backend-go 

    component: backend-go 

 
The final Kubernetes object that has to be created for the example application is the 

Ingress rule as shown below. This Ingress is used to allow traffic coming from outside the 

cluster to find the Service that exposes the example application. The correct service is 

detected by setting the name of the Service and its port number. This particular Ingress 

rule is applied to all HTTP requests for the local domain “backend-go.fliq.test”. Optional 

annotations have been set for the nginx Ingress controller installed in this cluster to in-

crease default timeout. 

 
apiVersion: networking.k8s.io/v1 

kind: Ingress 

metadata: 

  name: backend-go-ingress 

  annotations: 

    nginx.ingress.kubernetes.io/proxy-read-timeout: "3600" 

    nginx.ingress.kubernetes.io/proxy-send-timeout: "3600" 

spec: 

  rules: 

    - host: backend-go.fliq.test 

      http: 

        paths: 

          - path: / 

            pathType: Prefix 

            backend: 

              service: 

                name: fliq-backend-go 

                port: 

                  number: 80 


48 

6 Scalability testing 

In this chapter, scalability testing is performed against one of Fliq’s example REST API 

applications. First, the objective and scope are defined for scalability testing. The scala-

bility testing is done by load testing the example application running in the test environ-

ment. 

 
6.1 Objective 

The goal of this scalability testing is to get a better understanding of how an application 

can be scaled on Kubernetes. Scalability is tested by performing load tests against an 

example application running in a Kubernetes cluster. Scaling methods are later applied 

to see if the application can be scaled when running in a Kubernetes cluster. New load 

tests are performed after scaling to see if the application is able to scale. 

 
The first load tests are performed against the example application running in a single 

Pod. The application is later scaled vertically and horizontally to see if the scaling meth-

ods works. This scalability testing is limited in terms of only testing the scalability of a 

single application running in the cluster. The cluster nodes are not scaled since Cluster 

Autoscaler only works for specific cloud providers as of now (Github 2022). The test en-

vironment used in this thesis is a local Kubernetes cluster built using virtual machines. 

 
6.2 Load testing 

In this chapter load tests are performed against the example application deployed to the 

local Kubernetes cluster created as a test environment in the previous chapter. The tool 

selected to create and execute load tests is JMeter. For monitoring resource usage during 

load tests, Prometheus and Grafana is used. The first tests are executed against the initial 


49 

Deployment defined in chapter 5.2. New tests are run after scaling the application verti-

cally and horizontally. 

 
Since the example application is a REST API, load tests have to be executed against spe-

cific API resources, also known as API endpoints. In figure 11, the endpoint chosen for 

the load tests is shown. The HTTP request points to the local domain serving the example 

application deployed to the local Kubernetes cluster. The HTTP method is of type GET, 

and the selected endpoint is shown in the path input field. 

 
Figure 11.  API endpoint used for load testing 

 
For testing scalability, the load tests start by simulating 1000 users for the initial resource 

limits set for the Deployment in chapter 5.2.2. For each test the number of users is in-

creased by 1000 until the API runs out of resources. After this initial test, the application 

is scaled both vertically and horizontally to see if it can handle more users. Between the 

initial, vertical and horizontal scaling tests, metrics given by JMeter are compared. The 

average response time is analyzed to compare the average time it takes for the server to 

send HTTP response. In addition, throughput is analyzed to compare the number of re-

quests the server processes per second. 

 
6.2.1 JMeter setup 

In order to simulate a certain number of users sending requests to the API, a concurrency 

thread group is created in JMeter. Figure 12 shows the concurrency thread group used 

to simulate a certain number of users. In this particular example, target concurrency is 

set to 6000, meaning that 6000 threads or users are simulated. Ramp up time is 20 and 


50 

steps count 10, meaning that 600 new threads are created every 2 seconds up to 20 

seconds. Hold target rate time is set to 10, meaning that when the number of threads 

reaches 6000, the load is held for 10 seconds. In the beginning, the target concurrency 

is set to 1000, and later increased by 1000 for each new test. 

 
Figure 12.  Concurrency thread group example 

 
6.2.2 Initial test 

This initial test is executed against the example application described in the Deployment 

manifest shown in chapter 5.2.2. The application requests 100m CPU and 128Mi of RAM. 

In addition, it has a limit of 200m CPU and 256Mi of RAM. Figure 13 shows all resources 

found that uses the same app selector with the help of kubectl CLI. The figure shows that 

the initial Deployment manifest created a deployment object. Behind the scenes, this 

Deployment object also created a ReplicaSet object that manages the number of Pod 

instances. In the Deployment manifest the replicas field was set to 1. The figure shows 

that 1 Pod instance is running as expected. The Service object created in chapter 5.2.2 is 

also shown in figure 13 since the kubectl command used to show the resource outputs 

searches for all Kubernetes objects with the same app selector “fliq-backend-go”. 

 
51 

 
Figure 13.  Kubernetes resources used for initial test 

 
Table 1 shows JMeter results for 1000-6000 users. For each new test, the users are in-

creased by 1000 users. The average response time for the HTTP requests increases when 

the number of users increase. Throughput stays around 30 requests/s for all tests. How-

ever, for the final test of 6000 users 79,97% of the HTTP requests failed. These results 

show that the REST API endpoint fails at 6000 users for the initial resources set for the 

application.  

 
Table 1. JMeter results for initial test 

users avg response time (ms) error (%) throughput (requests/s) 

1000 18584 0 33,9 

2000 38808 0 32,4 

3000 63881 0 29,7 

4000 81407 0 32,4 

5000 106165 0 31,1 

6000 101575 79,97 33,7 

 
Figure 14 shows Pod details after the test of 6000 users with the help of “kubectl de-

scribe” command. This command shows the last state of the Pod was terminated. In ad-

dition, it shows the reason for the termination was “OOMKilled”, meaning that the Pod 

ran out of memory. This is the reason for why the majority of the HTTP requests failed 

for 6000 users. However, the Pod’s current state is running, and the restart count is 1, 

meaning that the Pod has automatically restarted after running out of memory. 

 
52 

 
Figure 14.  Pod details after running out of memory 

 
Figure 15 shows the max CPU usage for the initial test. The first test of 1000 users stays 

below the requested amount of 0,1 CPU cores. All tests between 2000-6000 users reach 

close to the CPU limit of 0,2 CPU cores. 

 
Figure 15.  Max CPU usage for the initial load test 

 
0

0,05

0,1

0,15

0,2

0,25

1000 2000 3000 4000 5000 6000

cp
u

 (
co

re
s)

users

max cpu requests limits


53 

Figure 16 shows the max RAM usage for the initial test. Both 1000 and 2000 users stays 

below the requested amount of 128 MiB. When simulating 3000-5000, RAM usage stays 

between the requested and limited amount. The final test of 6000 users stays below the 

limited amount according to what is monitored in Grafana. However, figure 14 shows 

that the Pod ran out of memory after the 6000-user test. 

 
Figure 16.  Max RAM usage initial test 

 
6.2.3 Vertical scaling 

For the next load test, the example application is scaled vertically. This means increasing 

the resources for the single instance running the example application. The YAML file be-

low shows the Deployment manifest used for scaling vertically. This manifest is the same 

as for the initial test, except the resource requests and limits have changed. The re-

quested CPU has increased from 100m to 200m, and the requested memory has in-

creased from 128Mi to 256Mi. In addition, CPU limit has increased from 200m to 400m, 

and memory limit has increased from 256Mi to 512Mi. 

 
0

50

100

150

200

250

300

1000 2000 3000 4000 5000 6000

ra
m

 (
M

iB
)

users

max ram requests limits


54 

apiVersion: apps/v1 

kind: Deployment 

metadata: 

  name: fliq-backend-go 

  labels: 

    app: fliq-backend-go 

    component: backend-go 

spec: 

  replicas: 1 

  selector: 

    matchLabels: 

      app: fliq-backend-go 

      component: backend-go 

  template: 

    metadata: 

      name: fliq-backend-go 

      labels: 

        app: fliq-backend-go 

        component: backend-go 

    spec: 

      containers: 

        - name: backend-go 

          image: fliqreg.azurecr.io/backend-go/local 

          imagePullPolicy: Always 

          ports: 

            - containerPort: 8080 

          resources: 

            requests: 

              cpu: 200m 

              memory: 256Mi 

            limits: 

              cpu: 400m 

              memory: 512Mi 
 

Table 2 shows the JMeter results for the vertical scaling load tests. Similar to the initial 

test, the average response time inceases when the number of users increase. The 

througput is sligthly higher for the 1000-3000 user tests, and after it stays around 35 

requests/s. For the vertical scaling tests there are test results for 1000-10000 users, since 

the REST API is able to scale beyond 6000 users compared to the initial test. The error 

percentage column shows that the REST API is able to handle 10000 users without any 

errors when the Pod is scaled vertically.  

 
55 

Table 2. JMeter results for vertical scaling test 

users avg response time (ms) error (%) throughput (requests/s) 

1000 14913 0 44,3 

2000 30436 0 41,7 

3000 51885 0 37 

4000 74671 0 35,6 

5000 91835 0 36,6 

6000 113648 0 34,9 

7000 128833 0 37,1 

8000 163461 0 34,4 

9000 178587 0 35,1 

10000 191182 0 36,4 

 
Figure 17 shows max CPU usage between 1000-10000 users when the Pod has been 

scaled vertically. For the first test of 1000 users, CPU usage is below the requested 

amount 0,2 cores. For all tests between 2000-10000 users, CPU usage is between the 

requested amount (0,2 cores) and limited amount (0,4 cores). According to the metrics 

given by Grafan, the Pod’s CPU did not reach its limit during the tests. 

 
56 

 
Figure 17.  Max CPU usage vertical scaling test 

 
Figure 18 shows max RAM usage between 1000-10000 users for the vertical scaling test. 

RAM usage stays below the requested amount 256MiB for 1000-5000 users. For 6000-

10000 users, RAM usage stays between the requested (256MiB) and limited (512MiB). 

 
Figure 18.  Max RAM usage vertical scaling test 

 
0

0,05

0,1

0,15

0,2

0,25

0,3

0,35

0,4

0,45

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

cp
u

 (
co

re
s)

users

max cpu requests limits

0

100

200

300

400

500

600

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

ra
m

 (
M

iB
)

users

max ram requests limits


57 

6.2.4 Horizontal scaling 

For the third and final load test, the example application is scaled horizontally instead of 

vertically. This means increasing the number of Pod instances running the example ap-

plication. Below the Deployment manifest is shown for horizontal scaling. The difference 

between this manifest and the one used for the initial test is that the number of replicas 

has been increased from 1 to 2. This change increases the number of Pod instances to a 

total of two when the manifest file is applied. Each Pod has the same resources as the 

Pod used for the initial test. However, this time there are two Pods that can share the 

load since the Service load balances traffic between the Pods it exposes. 

 
apiVersion: apps/v1 

kind: Deployment 

metadata: 

  name: fliq-backend-go 

  labels: 

    app: fliq-backend-go 

    component: backend-go 

spec: 

  replicas: 2 

  selector: 

    matchLabels: 

      app: fliq-backend-go 

      component: backend-go 

  template: 

    metadata: 

      name: fliq-backend-go 

      labels: 

        app: fliq-backend-go 

        component: backend-go 

    spec: 

      containers: 

        - name: backend-go 

          image: fliqreg.azurecr.io/backend-go/local 

          imagePullPolicy: Always 

          ports: 

            - containerPort: 8080 

          resources: 

            requests: 

              cpu: 100m 

              memory: 128Mi 

            limits: 

              cpu: 200m 

              memory: 256Mi 

 
Figure 19 shows that there are now two pod instances running the example application. 

The age column of the command output shows that there was only one Pod to begin 


58 

with since it has been running for 5 minutes and 56 seconds. The second Pod has only 

been running for 12 seconds. Kubernetes created 1 additional Pod as a result after the 

manifest was reapplied, after increasing the replicas field from 1 to 2. The desired state 

changed, and Kubernetes reacts by comparing the actual state with the desired state, 

and as a results notices that one additional Pod has to be created. 

 
Figure 19.  Pod instances for horizontal scaling 

 
Table 3 shows JMeter results for the horizontal scaling test between 1000-10000 users. 

The average response time increases as the number of users increases. This time the 

REST API is able to scale beyond 6000 users as well, since no HTTP requests failed during 

any test. Throughput is around 50 requests/s for all tests, except for the last test of 10000 

users decreased it down to 43 requests/s. 

 
Table 3. JMeter results for horizontal scaling test 

users avg response time (ms) error (%) throughput (requests/s) 

1000 11426 0 53,9 

2000 22656 0 54,6 

3000 32495 0 55,3 

4000 42246 0 55,2 

5000 49641 0 48,6 

6000 71618 0 49,5 

7000 90920 0 47,4 

8000 102083 0 49,5 

9000 120924 0 46,8 

10000 151675 0 43 

 
59 

Figure 20 shows the max CPU usage for the horizontal scaling test. The cluster has two 

Pods running the REST API application, and the Service in front of the Pods load balances 

the traffic. This phenomenon is shown both figures 21 and 22, both Pods are sharing the 

load. The results from Grafana that the CPU usage is for the most part evenly distributed 

between the Pods. For the 1000-user test, Pod 1 used less CPU than the requested 

amount (0,1 cores), and Pod 2 max CPU reaches the requested amount. For 2000-10000 

tests, the max CPU usage is between the requested (0,1 cores) and limited (0,2 cores).  

 
Figure 20.  Max CPU usage horizontal scaling test 

 
0

0,05

0,1

0,15

0,2

0,25

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

cp
u

 (
co

re
s)

users

max cpu (pod 1) requests limits

0

0,05

0,1

0,15

0,2

0,25

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

cp
u

 (
co

re
s)

users

max cpu (pod 2) requests limits


60 

Figure 21 shows max RAM usage for the horizontal scaling test between 1000-10000 

users. Similar to CPU usage, RAM usage is evenly distributed between the Pods. For 

1000-5000 user tests, max RAM usage is lower than the requested amount 128MiB. For 

6000 users and more, max RAM usage is between the requested (128MiB) and limited 

(256MiB). 

 
Figure 21.  Max RAM usage horizontal scaling test 

 
0

50

100

150

200

250

300

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

ra
m

 (
M

iB
)

users

max ram (pod 1) requests limits

0

50

100

150

200

250

300

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

ra
m

 (
M

iB
)

users

max ram (pod 2) requests limits


61 

 
6.3 Evaluation 

The initial load testing results show that the REST API endpoint under test failed to scale 

beyond 5000 users. This is further shown in figure 22, where the error percentage re-

trieved from JMeter results is compared between the initial, vertical, and horizontal scal-

ing tests. The Pod ran out of memory during the 6000-user test for the initial Deployment 

configuration. Both the vertical and horizontal scaling test results show that the REST API 

was able to scale beyond 6000 users, even up to 10000 users without HTTP requests 

failing. This shows that both vertical and horizontal scaling can be used to provide 

enough resources for an application running on Kubernetes. 

  
Figure 22.  HTTP error % for initial, vertical, and horizontal scaling tests 

 
The initial test caused the Pod to run out of memory during the 6000-user test. This 

caused the Pod to terminate, and Kubernetes automatically restarted the Pod. The rea-

son for Kubernetes doing this, is that it noticed that the actual state is different compared 

to the desired state, which is that one REST API Pod replica should always be running. 

This aligns with Kubernetes (2021a) describing Kubernetes as self-healing. 

 
0 0 0 0 0

79,97

0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0

0

10

20

30

40

50

60

70

80

90

1 0 0 0 2 0 0 0 3 0 0 0 4 0 0 0 5 0 0 0 6 0 0 0 7 0 0 0 8 0 0 0 9 0 0 0 1 0 0 0 0

ER
R

O
R

 %

USERS

initial test vertical scaling test horizontal scaling test


62 

In figure 23, average response time is compared between the initial, vertical, and hori-

zontal scaling tests. Initial and vertical scaling tests show similar linear growth for the 

average response time as the number of users is increased by 1000. Horizontal scaling 

test results show that the average response time decreased when the number of Pods 

increased from 1 to 2. This aligns with S. Jain & A. K. Saxena (2016) statement about 

response time decreasing when HTTP traffic is load balanced. The average response is 

not only lower when the REST API is scaled horizontally, it also grows at a lower rate for 

1000-5000 user tests.  

 
Figure 23.  Average response time for initial, vertical, and horizontal scaling tests 

 
In figure 24, throughput is compared between initial, vertical, and horizontal scaling tests. 

For the initial tests, throughput was consistently around 30 requests/s for 1000-6000 

users. When scaling vertically, throughput started higher from 44,3 requests/s, and 

slowly decreased to towards about 35 requests/s and stayed there between 4000-10000 

users. Horizontal scaling test resulted in higher throughput compared to the initial and 

vertical scaling test, for all 1000-10000 user tests. This aligns with D. Sharma (2018) say-

ing that horizontal scaling can result in higher throughput. 

 
0

50000

100000

150000

200000

250000

1 0 0 0 2 0 0 0 3 0 0 0 4 0 0 0 5 0 0 0 6 0 0 0 7 0 0 0 8 0 0 0 9 0 0 0 1 0 0 0 0

A
V

G
 R

ES
P

O
N

SE
 T

IM
E 

(M
S)

USERS

initial test vertical scaling test horizontal scaling test


63 

 
Figure 24.  Throughput for initial, vertical, and horizontal scaling tests 

 
In general, the load testing results show that applications running on Kubernetes can be 

scaled both vertically and horizontally. Both scaling methods solved the problem where 

the REST API ran out of memory when the number of users reached 6000. The resource 

usage metrics from Grafana also showed that the REST API needed more memory in 

order scale to 6000 users and beyond. In addition, Grafana metrics show that Kubernetes 

is able to scale horizontally by evenly distributing traffic between Pod instances. 

0

10

20

30

40

50

60

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

TH
R

O
U

G
H

P
U

T 
(R

EQ
U

ES
TS

/S
)

USERS

initial test vertical scaling test horizontal scaling test


64 

7 Conclusion 

This chapter is the thesis conclusion. First, the findings about how a Kubernetes cluster 

can be scaled is discussed, based on the researched scaling methods and load testing 

results. In addition, the limitations of this research are discussed. Finally, proposals are 

made for future research. 

 
7.1 Scaling a Kubernetes cluster 

The main goal of this thesis was to research how a Kubernetes cluster can be scaled with 

containerized applications running on it. This was done by first researching how applica-

tions can be scaled when deployed to Kubernetes. Next step was to research how the 

cluster itself can be scaled through Cluster Autoscaler. To get even further insight in Ku-

bernetes scalability, a REST API deployed to a local Kubernetes cluster was load tested 

using JMeter. 

 
Containerized applications running on Kubernetes can be scaled both vertically and hor-

izontally. Vertical scaling can be achieved by increasing CPU or RAM for each container 

inside a Pod, either manually by changing requested and limited resource specifications, 

or automatically using Vertical Pod Autoscaler (Kubernetes 2021l). Horizontal scaling can 

be achieved by running more than one instance of a Pod, either manually by changing 

the Deployment replicas property, or automatically using Horizontal Pod Autoscaler (Ku-

bernetes 2021k). 

 
The Kubernetes cluster itself can automatically be scaled using Cluster Autoscaler. Ku-

bernetes is able to automatically add or remove servers in the cluster depending on how 

much resources the Pods are using (Github 2021b). Cluster Autoscaler can be used to-

gether with HPA, where the change in number of Pods automatically change the number 

of nodes needed to provide enough resources for the current workload (Github 2021b). 


65 

This combination makes it possible to automatically scale both the application and the 

underlying infrastructure on demand. 

 
In order to get a better understanding of scaling applications running on Kubernetes, a 

REST API was deployed to a local Kubernetes cluster. This cluster consisted of three Va-

grant virtual machines, where one node is set as control plane and the other two as 

worker nodes. JMeter was used to load test one REST API endpoint. Load tests started 

from 1000 users, and after each test the number of users was increased by 1000 to a 

maximum 10000 users if the application could handle it. The same tests were run after 

scaling the application both vertically and horizontally in order to see how they affect 

results. 

 
For the initial Deployment configurations, the REST API endpoint was only able to handle 

5000 users. When simulating 6000 users, 79.97% of the HTTP requests failed. The reason 

for this was that the REST API ran out of memory. Kubernetes automatically restarted 

the Pod after termination since Kubernetes always compares the desired state with the 

actual state (Kubernetes 2021a). This aligns with Kubernetes (2021a) describing Kuber-

netes as a self-healing platform. The Grafana metrics showed that max CPU usage 

reached its limit of 0,2 cores during the 2000-6000 user tests. In addition, the Grafana 

results showed that memory usage reached closer the limit of 256MiB for each test. 

 
The REST API Pod was scaled vertically, by increasing CPU and RAM. As a result, the ap-

plication was able to scale up to 10000 users. However, scaling vertically only solved the 

REST API running out of memory. The average response time was similar to the initial 

test results. Scaling vertically increased the throughput during 1000-5000 user tests. 

However, between 6000-10000 users the throughput was the same as for the initial test. 

 
Horizontal scaling was applied on the application by adding one more Pod instance. 

Hence, the Service exposing the Pods was able to load balance the traffic between the 

Pods. The Grafana metrics show that the load was distributed. JMeter results showed 


66 

that horizontal scaling decreased the average response time for all 1000-10000 user tests. 

In addition, it decreased the average response time growth rate. Finally, horizontal scal-

ing results showed an increase in throughput for all tests. 

 
Scaling horizontally is preferred for Fliq’s example REST API. Vertical scaling was not able 

to bring more significant benefits than increasing the number of supported users. On 

the contrary, horizontal scaling was able to increase the number of supported users, de-

crease average response time, and increase throughput. In addition, horizontal scaling is 

able to serve clients even if one instance terminates. 

 
7.2 Limitations and future research 

The purpose of this thesis was to get a better understanding of how a Kubernetes cluster 

can be scaled. Kubernetes is complex container orchestration platform, that abstracts 

away the underlying infrastructure and its own internal components. Kubernetes can be 

deployed anywhere from bare metal to different cloud providers. Each cluster can have 

a unique setup, and different types of workloads running on it. The local Kubernetes 

cluster used for load testing in this thesis was not a production grade cluster. For future 

research, a cloud provider’s production grade Kubernetes service could be used for com-

parison, for example Azure Kubernetes Service (AKS), Amazon Elastic Kubernetes Service 

(EKS) or Google Kubernetes Engine (GKE). A cloud provider’s Kubernetes service would 

also allow one to test the Cluster Autoscaler feature of Kubernetes. 

 
The JMeter load testing client and the local Kubernetes cluster used in this thesis were 

running on the same laptop. It would have been preferable to have them running in 

completely different environments. The consumers or clients of a REST API usually have 

separate devices and are located somewhere else physically. For future research, latency 

caused by users’ location could be considered when load testing a Kubernetes cluster. 

 
67 

The purpose of the load tests in this thesis were to get a better understanding of how 

applications running on Kubernetes can be scaled. For this reason, the same REST API 

endpoint was used to test and compare the researched scalability methods. For future 

research, different types of applications, HTTP methods or REST API endpoints could be 

compared. 

 
For future research, the impact of Kubernetes cluster’s Ingress controller on scalability 

could be researched. The traffic coming into a cluster usually goes via the Ingress con-

troller. Future research could investigate if the Ingress controller can be scaled, and how 

it affects applications scalability. In addition, Service NodePort or LoadBalancer could be 

compared as alternatives to using an Ingress controller as the cluster gateway. 

 
68 

References 

A. Verma, L. Pedrosa, M. Korupolu, D. Oppenheimer, E. Tune, & J. Wilkes (2015). Large-

scale cluster management at Google with Borg. Proceedings of the Tenth European Con-

ference on Computer Systems (EuroSys '15). Association for Computing Machinery, New 

York, NY, USA, Article 18, 1–17. https://doi.org/10.1145/2741948.2741964 

 
BlazeMeter (2016). Advanced Load Testing Scenarios with JMeter Part 4 – Stepping 

Thread Group and Concurrency Thread Group. Retrieved 7.12.2021 from 

https://www.blazemeter.com/blog/advanced-load-testing-scenarios-jmeter-part-4-

stepping-thread-group-and-concurrency-thread 

 
BlazeMeter (2019). Performance Testing cs. Load Testing vs. Stress Testing. Retrieved 

7.12.2021 from https://www.blazemeter.com/blog/performance-testing-vs-load-test-

ing-vs-stress-testing 

 
Brazil (2018). Prometheus Up & Running. O’Reilly Media. ISBN 978-1292034148 

 
CNCF (2018). CNCF Cloud Native Definition v1.0. Retrieved 6.1.2021 from 

https://github.com/cncf/toc/blob/master/DEFINITION.md 

 
CNCF (2021a). Home Page. Retrieved 6.1.2021 from https://www.cncf.io 

 
CNCF (2021b). Graduated and Incubating Projects. Retrieved 6.1.2021 from 

https://www.cncf.io/projects/ 

 
Docker (2021a). What is a Container? Retrieved 18.4.2021 from 

https://www.docker.com/resources/what-container 

 
Docker (2021b). Best practices for writing Dockerfiles. Retrieved 6.9.2021 from 

https://docs.docker.com/develop/develop-images/dockerfile_best-practices 

https://github.com/cncf/toc/blob/master/DEFINITION.md
https://www.cncf.io/projects/
https://www.docker.com/resources/what-container


69 

 
D2iQ (2018). Brief History of Containers. Retrieved 18.4.2021 from 

https://d2iq.com/blog/brief-history-containers 

 
D. Merron & T. Idowu (2020). Introduction to Kubernetes Helm Charts. Retrieved 

17.10.2021 from https://www.bmc.com/blogs/kubernetes-helm-charts/ 

 
D. Sharma (2018). Response Time Based Balancing of Load in Web Server Clusters. 7th 

International Conference on Reliability, Infocom Technologies and Optimization (Trends 

and Future Directions) (ICRITO), 471-476, doi: 10.1109/ICRITO.2018.8748373. 

 
Erinle (2013). Performance Testing With JMeter 2.9. Packt Publishing, Limited.  

 
Github (2020). Components of Kubernetes. Retrieved 9.2.2022 from 

https://github.com/kubernetes/website/blob/main/static/images/docs/components-

of-kubernetes.png 

 
Github (2021a). Vertical Pod Autoscaler. Retrieved 1.11.2021 from 

https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler 

 
Github (2021b). Frequently Asked Questions. Retrieved 2.11.2021 from 

https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md 

 
Github (2022). Cluster Autoscaler. Retrieved 28.2.2022 from https://github.com/kuber-

netes/autoscaler/blob/master/cluster-autoscaler/README.md 

 
Google Cloud (2022). What are containers used for? Retrieved 10.2.2022 from 

https://cloud.google.com/learn/what-are-containers 

 
https://d2iq.com/blog/brief-history-containers
https://www.bmc.com/blogs/kubernetes-helm-cha