Vähä, Eemil 
An obstacle detection system for automated 
guided vehicles 
[Subject] 
 
 
 
 
 
 
 
  
Vaasa 2022 
School of Technology and Innovations 
Master’s thesis in Automation and Computer 
Science 
Energy and Information Technology 
2 
VAASAN YLIOPISTO 
Tekniikan ja innovaatiojohtamisen yksikkö 
Author: Vähä, Eemil 
Title of thesis:  An obstacle detection system for automated guided vehicles: 
[Subject] 
Degree: Master of technology  
Subject: Automation and information technology  
Supervisor: Jouni Lampinen 
Graduation year: 2023 Sivumäärä: 111 
Abstract: 
The objective of this master's thesis is to investigate the utilization of computer vision and object 
detection as an integral part of an automated guided vehicle's navigation system, which 
operates within the facilities of the target company. The rationale for conducting this research 
and developing an application for this purpose arises from the inability of automated guided 
vehicles to detect smaller or partially obstructed objects, and the lack of differentiation between 
stationary and moving objects. These limitations pose a safety hazard and negatively impact the 
overall performance of the system. The anticipated outcome of this thesis is a proof-of-concept 
computer vision application that would enhance the automated guided vehicle's obstacle 
detection capacity. The primary aim is to offer practical insights to the target company regarding 
the practical implementation of computer vision by developing and training a YOLOv7 object 
detection model, as a proposed resolution to the research problem. 
 
A thorough theoretical part of the required technologies and tools for training an object 
detection model is followed by a plan for the application to define requirements for the object 
detection model. The training and development are conducted using open-source and standard 
software tools and libraries. Python is the primary programming language employed throughout 
the development process and the object detector itself constitutes a YOLOv7 (You Only Look 
Once) object detection algorithm. The model is trained to identify and classify a predetermined 
number of objects or obstacles that impede the present automated guided vehicle system. 
Model optimization follows a fundamental trial-and-error methodology and simulated testing 
of the best-performing model. The data required for training the object detection model is 
obtained by attaching a camera to an automated guided vehicle and capturing its movements 
within the target company's facilities. The gathered data is annotated using Label studio, and all 
necessary data preparation and processing are carried out using plain Python. 
 
The result of this master’s thesis was a proof of concept for a computer vision application that 
would improve and benefit the target company’s day-to-day operations in their production and 
storage facilities in Vaasa. The trained model was substantiated to perform up to expectations 
in terms of both speed and accuracy. This project not only demonstrated the application's 
benefits but also laid grounds for the business to further utilize machine learning and computer 
vision in other areas of their business regarding the operational improvement competency of 
the target company’s employees. The results of this master’s thesis showed that finding an 
optimal object detection model for a specific dataset within a reasonable timeframe requires 
both appropriate tools and sufficient research data premises in terms of model configuration. 
The trained model could be utilized as a foundation for similar projects and thereby reduce the 
time and costs involved in preliminary research efforts.   
AVAINSANAT: Computer vision, object detection, deep learning, Convolutional Neural 
Networks, Automated guided vehicle 
 
3 
 
VAASAN YLIOPISTO 
Tekniikan ja innovaatiojohtamisen yksikkö 
Tekijä: Vähä, Eemil 
Tutkielman nimi: 
  
An obstacle detection system for automated guided vehicles: 
[Subject] 
Tutkinto:   
Oppiaine: Energia ja informaatiotekniikka  
Työn ohjaaja: Jouni Lampinen 
Valmistumisvuosi: 2023 Sivumäärä: 111 
TIIVISTELMÄ: 
Tämän diplomityön tarkoituksena on tutkia sekä hyödyntää konenäköä ja hahmon tunnistusta 
osana erään automaattitrukin navigointijärjestelmää, joka operoi kohdeyritysken tiloissa. 
Motivaationa tämän tutkielman laatimiselle ja sovelluksen kehittämiselle on nykyisten 
automaattitrukkien huono kyky havainnoida pienempiä tai muuten epämääräisiä esineitä ja 
esteitä. Automaattitrukit eivät myöskään kykene erottamaan liikkuvia esineitä paikallaan 
pysyvistä, mikä aiheuttaa selkeän turvallisuusriskin sekä heikentää järjestelmän toimintakykyä. 
Tämän työn lopputuloksena odotetaan saavan konseptitodistus konenäkö applikaatiosta, joka 
voisi mahdollisesti parantaa trukin kykyä havainnoida esteitä sen kulkureitillä. Tarkoituksena on 
tuoda kohdeyritykselle käytäntöön perustuvaa tietoa sovelletusta konenäöstä kehittämällä 
YOLOv7 hahmontunnistusmalli ratkaisuna tutkimus ongelmalle. 
   
Tämän tutkielman alku käsittää teoriaosan teknologioista ja tekniikoista, joita tällaisen 
hahmontunnistusmallin kehittämiseen vaaditaan, jonka jälkeen laaditaan suunnitelma 
sovellukselle vaatimusten määrittelyä varten. Koneoppimismallin kehittämiseen ja datan 
prosessointiin käytetään avoimen standardityökaluja sekä tietokirjastoja kuten Jupyter ja 
OpenCV. Ohjelmointikielenä käytetään pääasiassa Pythonia. Hahmontunnistusmallina 
käytetään YOLOv7 hahmontunnistus algoritmia, joka opetetaan havaitsemaan ja tunnistamaan 
määriteltyjä esteitä, jotka aiheuttavat ongelmia nykyiselle järjestelmälle. Mallin optimointi 
suoritetaan kokeilemalla eri määrityksiä, jonka jälkeen parhaan suorituskyvyn omaavaa mallia 
testataan simuloidussa ympäristössä.  Data, jota tarvitaan hahmontunnistusmallin 
kehittämiseen, kerätään kiinnittämällä kamera automaattitrukkeihin ja kuvaamalla sen 
päivittäistä toimintaa kohdeyrityksen tiloissa. Datan annotointiin käytettiin Label Studio nimistä 
ohjelmistoa ja kaikki vaadittava datan valmistelu ja prosessointi suoritettiin käyttäen Pyhton 
ohjelmointi kieltä.  
 
Tämän diplomityön tuloksena saatiin konseptitodistus konenäkösovellukselle, minkä 
käyttöönotto edistäisi kohdeyrityksen päivittäistä toimintaa nykyisissä tuotanto- ja 
varastointitiloissa. Kehitetyn mallin todettiin suoriutuvan riittävän hyvin odotuksiin nähden niin 
nopeuden kuin tarkkuuden osalta. Tämä projekti ei pelkästään todistanut sovelluksen hyötyä 
nykyiselle automaattitrukkijärjestelmälle, mutta loi myös perustan koneoppimisen laajemmalle 
hyödyntämiselle muualla liiketoiminnassa saadun tietotaidon myötä. Tämän diplomityön 
tulokset osoittivat, että optimaalisen objektintunnistusmallin löytämine kohtuullisessa ajassa 
vaatii sekä asianmukaiset työkalut että riittävän tutkimustietopohjan mallin konfiguroinnin 
kannalta. Koulutettua mallia voitaisiin hyödyntää pohjana muissa vastaavissa hankkeissa ja siten 
vähentää esitutkimukseen kuluvaa aikaa sekä kustannuksia. 
AVAINSANAT: Computer vision, object detection, deep learning, Convolutional Neural 
Networks, Automated guided vehicle 
 
4 
 
Table of contents 
1 Introduction 11 
1.1 Objective 13 
1.2 Structure 14 
1.3 Project plan 15 
2 Computer vision 17 
2.1 Object detection 17 
2.2 Artificial intelligence 19 
2.3 Deep learning 21 
2.4 How deep learning works 22 
2.4.1 Supervised and unsupervised learning 22 
2.4.2 Reinforcement learning and hybrid learning 23 
2.4.3 Nodes 24 
2.4.4 Linear function 24 
2.4.5 Activation function 25 
2.4.6 Loss and cost function 26 
2.4.7 Forward propagation 27 
2.4.8 Backpropagation 27 
2.4.9 Epoch 27 
2.4.10 Gradient descent 28 
2.4.11 Stochastic gradient descent (SGD) 29 
2.5 Convolutional Neural Networks (CNN) 30 
2.5.1 Input layer 31 
2.5.2 Convolutional layer 32 
2.5.3 Feature maps 33 
2.5.4 Pooling layer 33 
2.5.5 Fully connected layer 34 
2.5.6 Output layer 35 
2.6 Overfitting and underfitting 36 
5 
2.6.1 Learning rate and optimizer 37 
2.6.2 Dropout 37 
2.6.3 Regularization 38 
2.6.4 Data augmentation 39 
2.6.5 Error analysis 40 
2.7 Data collection 40 
2.7.1 Datasets 41 
3 Project plan 43 
3.1 System features and requirements 43 
3.1.1 Premises and functional requirements 43 
3.1.2 Non-functional requirements 49 
3.1.3 External interface requirements 49 
3.1.4 Quality attributes 50 
4 Implementation 51 
4.1 Yolov7 52 
4.1.1 Architecture 52 
4.1.2 Loss function 54 
4.1.3 Metrics used with YOLOv7 55 
4.1.4 Data augmentation 56 
4.2 Data collection 57 
4.3 Data preparation 57 
4.4 Label studio 58 
4.5 Model training 59 
4.5.1 Training preferences 60 
4.5.2 Model optimization 61 
4.5.3 Test dataset 67 
5 Testing 73 
5.1 Test requirements 74 
5.2 Test cases 75 
5.3 Test results 79 
6 
6 Results and observations 83 
7 Conclusions 88 
References 91 
 
 
Appendices 
 
Appendix 1. List of objects AGVs encounter when operating 1 
Appendix 2. Python commands 1 
Appendix 3. Hyperparameters of the tenth model 1 
Appendix 4. Training results 1 
Appendix 5. Test cases 1 
Appendix 6. Dependencies 1 
  
7 
Pictures 
 
Picture 1 Communication between the EWM system, AGV PC, and the AGVs .............. 11 
Picture 2 Visualization of the AGVs safety scanners and navigation system .................. 12 
Picture 3 Figure illustrating basic principles of object detection. (Dadhich, 2018) ........ 18 
Picture 4 Relations between the term’s artificial intelligence, machine learning, and deep 
learning (IBM, 2022) ....................................................................................................... 20 
Picture 5 General structure of a deep neural network (IBM, 2022) .............................. 23 
Picture 6 An overview of a node in a neural network (Chokmani;Khalil;Ouarda;& 
Bourdages, 2007) ............................................................................................................ 24 
Picture 7 Gradient decent (Haji & Abdulazeez, 2021) .................................................... 28 
Picture 8  Characteristics of a cost function (Goodfellow;Bengio;& Courville, 2018) .... 29 
Picture 9 Architecture of LeNet-5 a Convolutional Neural Network  
(LeCun;Bottou;Bengio;& Haffner, 1998) ......................................................................... 31 
Picture 10 Convolution between an image data matrix and kernel ............................... 32 
Picture 11 Max pooling ................................................................................................... 34 
Picture 12 A visualization of a fully connected layer (Kost;Altabey;Noori;& Taher, 2019)
 ........................................................................................................................................ 35 
Picture 13 Graphs visualizing under fit, balanced, and overfitting of a machine learning 
model (Amazon, 2023) ................................................................................................... 36 
Picture 14 A neural net before and after applying dropout 
(Srivastava;Hinton;Krizhevsky;Sutskever;& Salakhutdinov, 2014) ................................. 38 
Picture 15 Data augmentation achieved with Geometric deformation (Yorioka;Kang;& 
Iwamura, 2020) ............................................................................................................... 39 
Picture 16 Example of a training batch with mosaic data augmentation used to train the 
model .............................................................................................................................. 56 
Picture 17 Label Studio user interface ............................................................................ 59 
Picture 18 Example of a linear motion labelling sequence in Label Studio ................... 59 
Picture 19 The horizontal black solid line represents the limit that an object cannot 
surpass, and the vertical dashed lines illustrates the width of the AGV. A pallet-jack is 
classified with a confidence of 95%. ............................................................................... 75 
8 
Picture 20 Test cases for pallet-jack. ............................................................................... 76 
Picture 21 Test cases for rider-truck. .............................................................................. 77 
Picture 22 Test cases for reach-truck. ............................................................................. 78 
Picture 23 Test cases for counterbalance-truck.............................................................. 79 
Picture 24 Example of a training batch used to train model number ten. ....................... 6 
Picture 25 Test case 1 for pallet-jack. ............................................................................... 1 
Picture 26 Test case 2 for pallet-jack. ............................................................................... 1 
Picture 27 Test case 1 for rider-truck. ............................................................................... 2 
Picture 28 Test-case 2 for rider-truck................................................................................ 2 
Picture 29 Test case 1 for reach-truck. ............................................................................. 3 
Picture 30 Test case 2 for reach-truck. ............................................................................. 3 
Picture 31 Test case 1 for counterweight-truck. ............................................................... 4 
Picture 32 Test case 2 for counterweight-truck. ............................................................... 4 
 
 
 
Figures 
 
Figure 1 Project plan in form of a Gantt chart ................................................................ 16 
Figure 2 Process flowchart to determine detectable objects ......................................... 45 
Figure 3 Braking distance-speed graph for soft stop ...................................................... 46 
Figure 4 Braking distance-speed graph for emergency stop .......................................... 47 
Figure 5 Development process ....................................................................................... 51 
Figure 6 Yolov3 architecture with introduction to PP-Yolo features (Xiang, ym., 2020) 53 
Figure 7 Extended efficient layer aggregation network (Wang;Bochkovskiy;& Mark Liao, 
2022) ............................................................................................................................... 53 
Figure 8 Results of second training run. ......................................................................... 63 
Figure 9 F1 curve after the fourth training run. ............................................................. 64 
Figure 10  Confusion matrix of training run six. ............................................................. 64 
Figure 11 Results of the eighth training run with learning rate and lower mosaic 
parameter. ...................................................................................................................... 65 
9 
Figure 12 The tenth model’s performance on validation and test dataset and average 
performance of all models. ............................................................................................ 68 
Figure 13 Distance estimation between an object and the AGV. ................................... 75 
Figure 14 Number of false detections in frames by confidence threshold and object class.
 ........................................................................................................................................ 82 
Figure 15 Results of the third training run. ...................................................................... 1 
Figure 16 Results of the fifth training run with a larger dataset. ..................................... 1 
Figure 17  F1 score of the fifth training run. ..................................................................... 2 
Figure 18 Results of seventh training run with lower mosaic hyperparameter. .............. 2 
Figure 19 Confusion matrix after the seventh training run. ............................................. 3 
Figure 20 Results of the ninth training run with lower learning rate and initial mosaic 
parameter. ........................................................................................................................ 3 
Figure 21 Confusion matrix of the ninth training run. ...................................................... 4 
Figure 22 Results of the tenth training run with adjusted augmentation parameters. ... 4 
Figure 23 Confusion matrix of the tenth training run. ..................................................... 5 
 
 
Tables 
 
Table 1 Popular Activation functions (Pragati, 2023) ..................................................... 26 
Table 2 List of objects that the model is trained to detect and classify. ........................ 45 
Table 3 First dataset used for training. ........................................................................... 60 
Table 4 Second dataset used for training. ...................................................................... 61 
Table 5 Final dataset used for training............................................................................ 61 
Table 6 All training runs with respective parameters. .................................................... 62 
Table 7 Overall performance of each model. ................................................................. 66 
Table 8 Test results with the highest scores highlighted with respect to the metric. .... 70 
Table 9 Summary table of the trained model ................................................................. 72 
Table 10 Test results by class and evaluation criteria. .................................................... 80 
Table 11 Number of frames it took for the model to detect and classify the objects by 
test case. ......................................................................................................................... 80 
Table 12 Results for testing number of false detection in frames. ................................. 81 
10 
Table 13 False detections by duration. ........................................................................... 81 
 
 
Abbreviations 
 
AGV  Automated guided vehicle 
ANN  Artificial neural network 
AP   Average Precision 
CNN  Convolutional neural network 
ERP  Enterprise resource planning system 
EWM  Extended warehouse management 
FN   False Negative 
FP   False Positive 
FPS   Frames Per Second 
IoU   Intersect over Union 
mAP   mean Average Precision 
OIC   Operational improvement competency 
STH  Sustainable technology hub 
TP   True Positive 
YOLO  
 
You Only Look Once, an object detection algorithm 
named YOLO 
 
11 
1 Introduction 
In 2022, the construction phase of Wärtsilä's forthcoming logistics centre and factory in 
Vaasa reached completion, commencing the implementation phase of a new logistical 
operations model. This model comprises a new extended warehouse management 
system (EWM) and an integrated automated guided vehicle (AGV) system. The 
fundamental concept underlying this model is that the EWM system assigns tasks to the 
AGVs, which subsequently execute these tasks as instructed. The AGV fleet consists of 
six automated vehicles, which all perform identical tasks being namely transportation of 
euro pallets from location A to location B. These tasks, commonly known as "warehouse 
tasks", entail specific instructions for the pickup and delivery of euro pallets. 
 
 
Picture 1 Communication between the EWM system, AGV PC, and the AGVs 
 
The AGVs use laser navigation to navigate the warehouse and factory. A real-time route 
calculation is done based on the locations of reflectors installed in the building that a 
12 
spinning laser beam on top of the AGV detects (Solving, 2022). This means that the AGVs 
always use a predetermined fixed path and do not deviate from it independently. The 
automated guided vehicles (AGVs) are equipped with laser scanners and bumpers to 
detect obstacles. The AGVs feature three laser scanners located in the front and one in 
the back. One of the laser scanners is positioned top of the AGV and is angled 45 degrees 
towards the ground. Meanwhile, the remaining two front scanners and the one at the 
back are located approximately 100mm from the ground level on the AGV's lower 
section. These scanners function in parallel with the ground, thereby creating a two-
dimensional plane encompassing the AGV. 
 
 
Picture 2 Visualization of the AGVs safety scanners and navigation system 
 
If the laser scanners detect an obstacle, it slows down and stops if the obstacle is not 
removed. However, the AGVs cannot detect objects under 100 mm high and lying on the 
floor. Neither are they able to detect obstacles horizontally in the air, e.g., spikes of a 
forklift. This does, of course, cause a safety risk and is currently avoidable only by 
precautionary measures. In practice, this would mean that somebody would have to 
make sure that the AGV paths are clear of all possible undetectable objects all the time. 
This is different from the planned course of action since the presumption is that the AGVs 
would be able to operate independently without supervision.    
 
 
Forward
Re ectors
Bumpers
Laser scanners
Rota ng laser beam
Laser scanner
Safety l ight
150 mm
13 
1.1 Objective 
 
The purpose of this master thesis is to create a proof of concept for a computer vision 
application by training an object detection that could be utilized to improve the current 
AGV system’s performance in terms of avoiding crashing into obstacles. The general idea 
is that a camera is placed on top of an AGV, facing the direction AGVs move when 
operating. Video data from the camera would then be fed to an embedded system 
capable of running the required object detection model. The embedded system would 
analyse the data and signal the AGV to stop if an object is detected in its path based on 
a predefined set of criteria. This master’s thesis is conducted as an assignment to 
Wärtsilä in return for a compensation that is paid as a lump-sum after the thesis is 
finalized.  
 
All possible variables must be considered in order to successfully train an object 
detection model that meets all the requirements that an application of this nature sets. 
The nature of the dataset used to train the model depends on not only the observed 
objects but also noise levels, footage quality, and lighting intensity. The number of ways 
a computer vision application could be developed considering the number of available 
open-source object detectors and data processing techniques, is quite exhaustive. 
However, most of them might not give the desired output because they do not fit the 
nature of the dataset in question. Therefore, preliminary research on the topic is 
fundamental (Davies, 2017). 
 
Nowadays, computer vision is utilized widely in different applications. Recent 
developments in deep learning techniques have increased the accuracy of applications 
like object detection, tracking, and image classification, which has triggered the 
development of more complicated autonomous systems like drones, self-driving 
automobiles, humanoids, and other things (Dadhich, 2018). In addition, the constantly 
growing number of computer vision applications, research papers and books written 
14 
about computer vision applications, and available open-source datasets all s the work 
required to conduct this master’s thesis.  
 
To showcase an example of valuable references considering this master’s thesis, the 
Technical University of Munich claims that they have put together the first annotated 
dataset with a logistics-related focus available to the public. This data set consists of over 
39 thousand images of logistics-specific objects such as pallets, small load carriers, 
stillages, forklifts, and pallet trucks which are common objects within the logistics 
domain (Mayershofer;Holm;Molter;& Fottner, 2022). This dataset could be, e.g., utilized 
to pre-train or test an object detection model. In addition, there are over one thousand 
search results on conference papers and books in IEEE Xplore digital library with the 
entry “Computer vision” to be included in the document title within the last five years. 
 
The main objective of this thesis is to provide practical knowledge about the usability of 
computer vision in Wärtsilä’s business operations in form of a proof of concept for a 
computer vision application. The gained knowledge helps further develop the 
application and possibly utilize the same technology in other business areas. However, 
there is a lot to improve considering both the safety and efficiency of the automated 
guided vehicle system, and that is what the outcome of this master’s thesis should 
primarily improve. This master’s thesis aims to answer the following research question.  
“Can object detection be utilized to improve the current automated guided vehicle 
system’s performance in terms of avoiding crashing into obstacles it cannot detect with 
its current scanners?” 
 
1.2 Structure  
This thesis can be divided into four main sections. First, the introduction section 
containing chapter 1 defines the current problem narrowing down the scope of this 
thesis into a specific research question that the study seeks to answer. The following 
section is the methods section containing chapters 2 and 3, where all relevant theory 
15 
related to the topic of this master’s thesis is explained. This helps the reader to better 
understand the methods used to carry out the training of an object detection model.   
 
The primary body of the research comprises the methods and the results section, which 
comprises of rest of the chapters. The implementation section containing chapters 4 and 
5 outlines the actual development carried out during the research, which is then 
discussed in the last section of this thesis. Finally, the discussion section including 
chapters 6 and 7, discuss the study’s outcome and present its impact on the current 
problem and the broader implications of the research. 
  
 
1.3 Project plan 
To provide a clear and comprehensive overview of the timeframe for conducting this 
master's thesis, a Gantt chart has been utilized and is presented in Figure 1. The Gantt 
chart displays all significant phases involved in the process of conducting this master's 
thesis. This chart serves as a visual aid, helping to illustrate the entire duration of the 
project and enabling stakeholders to understand the various stages involved. By using 
the Gantt chart, the timeline for conducting this master's thesis can be effectively 
monitored, ensuring that the project remains on track and within the designated time 
frame. 
16 
 
Figure 1 Project plan in form of a Gantt chart 
 
17 
2 Computer vision 
For the reader to properly understand the general concepts and methods related to 
computer vision and object detection in general, a sufficient review of the topic is 
fundamental. Computer vision is a broad term for methods and underlying theories 
related to image processing. The definition of what distinguishes computer vision from 
image processing, pattern recognition, or machine vision varies, and the boundary 
between them needs to be clarified (Fisher, ym., 2014). In the book “Dictionary of 
Computer Vision and Image Processing” (2014), there are over three and a half thousand 
terms related to computer vision and image processing. However, this does not cover 
the most recent ones related to the latest developments in deep learning techniques. 
However, this chapter only explains terms, methods, techniques, and the related theory 
affiliated with this master’s thesis. 
 
The underlying technology is usually computer vision when discussing practical 
applications capable of detecting and identifying objects from image and video data. 
Computer vision can be utilized in various applications like image classification, object 
detection, object tracking, image geometry, image segmentation, and image generation 
(Dadhich, 2018). The primary emphasis of this master's thesis will be on object detection, 
as well as related techniques and tools. 
 
2.1 Object detection 
In short terms, object detection is about localizing objects in images and identifying what 
the object is based on predefined specifications. A generic image recognition model 
where an image is passed through the model, which gives information as output about 
the class name for the object in the image, the object centre pixel, a bounding box 
surrounding the object, and the class for each pixel in the image. The first and third are 
usually affiliated with computer vision (Dadhich, 2018). 
 
18 
 
Picture 3 Figure illustrating basic principles of object detection. (Dadhich, 2018) 
 
Achieving vision artificially is a challenge on its own. The problem is complex because it 
is an inverse problem, where the solution is based on inadequate data to specify the 
solution fully. In computer vision, this problem is tackled by seeking a solution by 
applying machine learning to clarify the differences between possible solutions, i.e., 
trying to interpret the visual world with images and reconstructing properties, such as 
shape, illumination, and colour distributions (Szeliski, 2022).  
 
Modelling the real world is nearly impossible, considering all the possible challenges 
encountered when developing a computer vision application in the ever-changing, 
complex real world. Common challenges that need to be acknowledged when working 
with an object detection application are occlusion, viewpoint changes, variation in size, 
non-rigid objects, and motion blur, especially in an environment where these change a 
lot (Dadhich, 2018).  These are all discussed in more detail in chapter 4 when the actual 
model is trained.  
 
Object detection is just one computer vision technique that helps, e.g., detect objects in 
an image. However, there are many different variations of object detection models 
19 
having different types of architectures that suit certain types of applications. In order to 
get the best-fitting and performing network for a specific application, comprehensive 
research on the topic is fundamental (Zhong-Qiu;Peng;Shou-Tao;& Xindong, 209). The 
object detection algorithm used to train the model in this master’s thesis is YOLOv7. A 
more detailed explanation of YOLOv7 is laid out in chapter 4.1.  
 
2.2 Artificial intelligence  
The term artificial intelligence is somewhat elusive. However, it is beneficial to 
understand the relations between the different concepts to get a clear overview and 
understanding of the technologies used in this thesis. One of the founding fathers of the 
term “artificial intelligence” John McCarthy (2007, s. 2) defines it the following way in his 
paper “What is artificial intelligence?”: 
It is the science and engineering of making intelligent machines, especially 
intelligent computer programs. It is related to the similar task of using computers 
to understand human intelligence, but AI does not have to confine itself to methods 
that are biologically observable. 
 
Artificial intelligence is a field that, in its simplest basic form, combines computer science 
and substantial datasets to facilitate problem-solving. Additionally, it includes the 
subfields of artificial intelligence known as machine learning and deep learning, which 
are commonly addressed together. These fields use AI algorithms to build intelligent 
systems that predict or categorize information based on incoming data (IBM, 2022). 
20 
 
Picture 4 Relations between the term’s artificial intelligence, machine learning, and deep 
learning (IBM, 2022)  
 
Machine learning and deep learning are both subfields of artificial intelligence. The key 
difference between machine learning and deep learning is in the method how learning 
is accomplished. Machine learning is essentially artificial intelligence that is more 
dependent on human intervention when it comes to learning. It makes more linear and 
simple correlations and is usually trained on smaller datasets. Whereas deep learning is 
a subfield of machine learning that relies on supervised learning with artificial neural 
networks (IBM, 2022). Deep learning essentially a mathematically more complex 
evolution of traditional machine learning algorithms. They both use same learning 
techniques being both supervised and unsupervised but deep learning utilize more 
complex layered structure of algorithms known as artificial neural networks. A typical 
deep learning architecture is a convolutional neural network that consists of a fully 
connected neural network that is capable of learning complex non-linear information 
from large datasets and thereby prospering in the field of image classification and is 
therefore useful in computer vision (Campesato, 2020).  
 
21 
In general, deep learning models outperform traditional machine learning methods 
when the learning dataset grows and is therefore used in applications with large 
amounts of data. It is to be noted that deep learning requires much more data than 
traditional machine learning due to the complex multilayer structure deep learning 
utilizes in order to make accurate inference (Amitha;Amudha;& Sivakumari, 2020). 
 
2.3 Deep learning 
Deep learning or deep models, also referred to as neural networks, are a core element 
of computer vision-based object detection. Neural networks date back to the 1940s 
when they were used to simulate the human brain system. The popularity of neural 
networks increased in the 1980s and the 1990s when a back-propagation algorithm was 
proposed in a journal article but eventually died due to a lack of computational resources 
(Zhong-Qiu;Peng;Shou-Tao;& Xindong, 209).   
 
Deep learning comprises different types of neural network architectures such as 
convolutional neural network (CNN), recurrent neural network (RNN), long short-term 
memory (LSTM), and gated recurrent unit (GRU) (Campesato, 2020). Convolutional 
neural networks are the most relevant of these architectures, considering the topic of 
this master’s thesis. From a more general high-level perspective, deep learning with 
supervised learning generally consists of defining the neural network, estimating data 
point, i.e., labelling the dataset, calculating the loss or error of each estimate, and 
reducing the error via gradient descent (Campesato, 2020). These steps are explained in 
this chapter including a general overview of CNN architecture.  
 
In addition to the well-established deep learning architectures, there exist several 
relatively recent advancements, including input convex neural networks (ICNN) 
(Amos;Xu;& Kolter, 2017) and global optimization networks (GON) developed by 
(Zhao;Louidor;& Gupta, 2022). These architectures represent cutting-edge 
developments in the field of deep learning and have shown significant promise in 
improving the performance and accuracy of object detection systems. 
22 
2.4 How deep learning works 
The general structure of a deep learning neural network requires at least two hidden 
layers whereas very deep learning would require at least ten hidden layers. These layers 
consist of connected nodes or often referred to as neurons, each of which builds on the 
one before it to improve and refine the prediction or classification.  The number of 
hidden layers can be adjusted in the training process with a parameter also known as a 
hyperparameter (Davies, 2017). In general, deep neural networks are trained with 
supervised learning, unsupervised learning, reinforcement learning, as well as hybrid 
learning (Amitha;Amudha;& Sivakumari, 2020). 
 
2.4.1 Supervised and unsupervised learning 
Supervised and unsupervised learning are both machine learning techniques. In 
supervised learning, the task is to predict an output y given an input x based on a training 
dataset. I.e., supervised learning uses labelled datasets to train the machine learning 
model or neural network. Unsupervised learning relies on detecting and identifying 
patterns or structures in data with algorithms without human interaction (Fisher, et al., 
2014). Generally deep learning relies on supervised learning in object detection and 
classification-related tasks (Goodfellow, Bengio, & Courville, 2018).  
 
23 
 
 
Picture 5 General structure of a deep neural network (IBM, 2022) 
 
2.4.2 Reinforcement learning and hybrid learning 
Reinforcement learning is a machine learning method that is based on rewarding actions 
that have a positive impact and punishing actions that have a negative impact on the 
environment. Essentially the goal is to a maximize the total reward of the agent making 
the actions based on the state of the environment. The agent learns in an interactive 
environment by trial and error with the help of the feedback from its own actions 
(Amitha;Amudha;& Sivakumari, 2020). 
 
Hybrid learning on the other hand is a deep neural network that is a combination 
different architectures and utilize generative (unsupervised) as well as discriminative 
(supervised) components to learn. In general, generative models can generate new data 
instances whereas discriminative models only discriminate between different kinds of 
data inputs (Kuleshov & Ermon, 2017). 
24 
 
2.4.3 Nodes 
 A node is the smallest building element of a neural network and is responsible for 
adjusting the variable values processed by the network. Most of the nodes or neurons 
in the network consist of two different functions. The nodes are composed of linear and 
activation functions (Bernico, 2018). The number of neurons is also adjustable with a 
hyperparameters and is generally part of the initial set-up phase of a neural network 
(Campesato, 2020).  
 
 
Picture 6 An overview of a node in a neural network (Chokmani;Khalil;Ouarda;& 
Bourdages, 2007) 
 
2.4.4 Linear function  
The function in a linear function is an essentially linear regression function that 
calculates a weighted sum based on the inputs by multiplying each input with a 
coefficient, also known as a weight. The final output z of a linear function is calculated 
with function 1 where {x1, x2, . . . , xn} are the inputs and {w1, w2, . . . , wn} are the weights 
25 
or coefficients. The values of the weighted sums are calculated in each neuron in a layer 
where after the same process is repeated in the next layer (Bernico, 2018). 
 
𝑧 =  𝑥1𝑤1 +  𝑥2𝑤2 + ⋯ + 𝑥𝑛𝑤𝑛 (1) 
 
 
2.4.5 Activation function 
After calculating the weighted sum with a linear function, the value is fed to the following 
function called an activation function. An activation function is a non-linear function 
used in neural networks to introduce non-linearity to the model. A neural network 
without an activation function would essentially be a linear regression model performing 
linear transformation to the inputs of the neurons (Bernico, 2018). 
 
A few example of common activation functions used in machine learning and deep 
learning are Sigmoid, TanH, and ReLU, each with properties suitable for different types 
of problems. The difference between these activation functions is the function that 
produces a specific output range for each function  (Pomerat;Segev;& Datta, 2019). 
 
Activation 
function 
Function Output range 
Sigmoid 
𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑥) =  
1
1 + 𝑒−𝑥
  
[0,1] 
TanH 
𝑡𝑎𝑛ℎ(𝑥) =  
𝑒𝑥 − 𝑒−𝑥
𝑒𝑥 + 𝑒−𝑥
  
[-1,1] 
ReLU  𝑓(𝑥) =  max (0, 𝑥)  [0, ∞] 
Binary Step  𝑓(𝑥) = {
0, 𝑥 < 0
1, 𝑥 ≥ 0
 [0,1] 
Linear  𝑓(𝑥) = 𝑥 [-∞, ∞] 
Leaky ReLU 𝑓(𝑥) =  max (0.1 ∗ 𝑥, 𝑥) [0, ∞] 
Parametric ReLU 𝑓(𝑥) =  max (𝑎𝑥, 𝑥) [0, ∞] 
26 
ELU 
𝑓(𝑥) = {
𝑥, 𝑥 ≥ 0
𝛼(𝑒𝑥 − 1), 𝑥 < 0
 
[-1, ∞] 
Softmax 
𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑧𝑖) =  
exp (𝑧𝑖)
∑ exp (𝑧𝑗)𝑗
 
[0,1] 
Swish 𝑓(𝑥) = 𝑥 ∗ 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑥) [0,1] 
GELU 
𝑓(𝑥) = 0.5𝑥(1 + 𝑡𝑎𝑛ℎ√
2
𝜋
(𝑥 + 0.044715𝑥3)) 
[-1, ∞] 
SELU 
𝑓(𝛼, 𝑥) = 𝜆 {
𝛼(𝑒𝑥 − 1), 𝑥 < 0
𝑥, 𝑥 ≥ 0
 
[-1, ∞] 
Table 1 Popular Activation functions (Pragati, 2023) 
 
2.4.6 Loss and cost function 
A loss function measures how well a neural network predicts the correct output for a 
given input. The loss function reduces the difference between an expected and actual 
output so that the network can learn to predict more accurately. At the same time, a 
cost function is a measure of the network's overall error and is calculated by averaging 
the loss function across all training examples. The goal is to find the weights and biases 
that minimize the cost function, which is achieved with an optimizer in deep learning. A 
model is said to converge when the loss settles within an error range around the final 
value, i.e., when training does not improve the model’s performance anymore  
(Yingjie;Duo;Stanislao;& Xiaohui, 2022). 
 
Cross entropy-based loss function is one of the most common types of loss function used 
in deep neural networks, especially in classification related tasks such as object detection. 
Cross-entropy loss is calculated with equation 2 where gj is the discrete ground-truth 
label of class j, and pj is the output of the output layer of the network also known as the 
softmax layer (Yingjie;Duo;Stanislao;& Xiaohui, 2022).  
 
𝐿𝐶𝐸 =  − ∑ 𝑔𝑗log(𝑝𝑗)𝑗   (2) 
 
27 
2.4.7 Forward propagation 
Forward propagation is the process in deep learning where the input data is processed 
within the neural network, starting from the input layer, and ending at the output layer, 
eventually passing multiple hidden layers if present. During forward propagation, the 
input data is transformed through a series of computations, also known as "forward 
computations," in each layer of the network until it reaches the output layer, where the 
result or prediction is produced. The outputs are based on the weights of the linear 
functions the neurons in each layer, as well as the activation function. The forward 
propagation process is used to predict the desired target variable based on the input 
data and the parameters of the trained neural network (Bernico, 2018). 
 
2.4.8 Backpropagation 
Deep neural networks gained popularity in the 1980s and 1990s when the concept of 
backpropagation was presented by Rumelhart et al. in a nature article 
(Rumelhart;Hinton;& Williams, 1986). The actual training of the neural network happens 
when updating the weights of the neurons based on the network’s error calculated with 
the cost function. The goal of backpropagation is to adjust the weights of the neurons in 
the neural network so that the network would make a more accurate prediction. This is 
accomplished with a backpropagation process that calculates the gradient using the 
chain rule of calculus. However, backpropagation refers only to the operation of 
calculating the gradient, which is then used to train the network with an optimizer, such 
as stochastic gradient descent (Goodfellow;Bengio;& Courville, 2018).  
 
2.4.9 Epoch 
Each round in the training process is called an epoch. During each epoch, the entire 
network dataset is processed, meaning that the model's parameters, such as weights, 
are updated based on the backpropagation process. The number of epochs eventually 
determines how long a model is trained and it can be adjusted with a hyperparameter 
during the training process (Campesato, 2020). 
28 
 
2.4.10 Gradient descent 
Deep learning usually introduces some form of optimization. Optimization refers to the 
task of either minimizing or maximizing some function by altering some variable. In this 
case, the cost function is the function to be minimized, and the weights of the nodes in 
the network are the variables altered (Goodfellow;Bengio;& Courville, 2018). In deep 
learning, an optimizer is an algorithm used to minimize the loss function, and the two 
most common optimizers are Gradient Descent and Stochastic Gradient Descent 
(Pomerat;Segev;& Datta, 2019). 
 
 
 Picture 7 Gradient decent (Haji & Abdulazeez, 2021) 
 
The principle of gradient decent is to take repeated steps in the opposite direction of the 
gradient on a specific point of the function cost function. This is considered the steepest 
descent direction of the function. However, taking a step in the other direction results in 
a local maximum referred to as gradient ascent (Goodfellow;Bengio;& Courville, 2018).   
 
 
29 
 
Picture 8  Characteristics of a cost function (Goodfellow;Bengio;& Courville, 2018) 
 
The size of these steps is called the learning rate (α). With a large learning rate, the 
algorithm covers more ground in each step but has a chance of overshooting and 
completely missing the desired global minimum. With a low learning rate, this risk is 
neglectable; however, being a more time-consuming process and has the risk of being 
stuck in the local minima of the function. The learning rate can also be adjusted with a 
hyperparameter during training and is often used to stabilize the process. The most 
common learnings used in deep learning are 0.001, 0.003, 0.01, 0.03, 0.1, and 0.3 (Haji 
& Abdulazeez, 2021).  
 
2.4.11 Stochastic gradient descent (SGD) 
One of the most common optimizers or learning methods in deep learning and machine 
learning is stochastic gradient descent (SGD), also known as the online update. 
Stochastic gradient descent converges faster than regular gradient descent 
(LeCun;Bottou;Bengio;& Haffner, 1998).   
 
In short, gradient descent is an iterative algorithm that descends a function's slope in 
steps from a random point until it reaches its lowest point. The learning rate is an 
important parameter that has an impact on the convergence of the algorithm since it 
determines the size of each step and is usually chosen by trial and error rather than 
30 
based on any general guidance (Goodfellow;Bengio;& Courville, 2018). Stochastic 
gradient modifies the network's configuration after each training point in an effort to 
locate the global minimum. Instead of reducing the error or determining the gradient for 
the complete data set, the gradient is calculated based on a randomly chosen batch of 
samples, which could be as small as a single sample. In practice, this is achieved by 
randomly shuffling the dataset and moving through the batches in steps, making it 
considerably faster and computationally more efficient than regular gradient descent. 
SGD is, therefore, suitable for large-scale datasets (Bottou, 2018).  
 
Other optimizers used to optimize neural networks in deep learning are Momentum, 
Nestrov Momentum, Adaptive Gradient Descent (AdaGrad), Adaptive Delta (AdaDelta), 
Root Mean Square Propagation (RMSProp), Adaptive Moment Estimation (Adam), and 
Maximum Adaptive Moment Estimation (AdaMax) each having its distinguishing 
features (Haji & Abdulazeez, 2021). However, explaining all of them would be irrelevant. 
The critical takeaway is to understand that optimization algorithms are the backbone of 
the learning process in deep learning neural networks (Goodfellow;Bengio;& Courville, 
2018). 
 
Other widely used deep learning methods are learning rate decay, dropout max-pooling, 
batch normalization, skip-gram, and transfer learning each used to and optimizing the 
model and solve different problems as well as reduce training time. E.g., dropout which 
is also explained in more detail in chapter 2.6 as a method to address overfitting 
(Amitha;Amudha;& Sivakumari, 2020). 
 
2.5 Convolutional Neural Networks (CNN) 
Recent computer vision advances have made deep learning a noteworthy and 
advantageous asset consumers utilize more often, especially in the commercial sector 
where convolutional neural networks were used to solve critical commercial applications 
such as AT&T optical character recognition (OCR) application for reading checks in 1998 
and remaining the optimum solution up to today (Goodfellow;Bengio;& Courville, 2018).    
31 
 
Convolutional neural networks are a very versatile and relatively straightforward model 
in theoretical terms, however, being highly applicable to various perceptual tasks such 
as object detection. Convolutional neural networks are a trainable neural network 
architecture capable of learning invariable features from large sets of labelled data using 
an architecture composed of stages consisting of a filter bank layer, a non-linearity layer, 
and a feature pooling layer (LeCun;Kavukcuoglu;& Farabet, Convolutional Networks and 
Applications in Vision, 2010).  
 
 
 
Picture 9 Architecture of LeNet-5 a Convolutional Neural Network  
(LeCun;Bottou;Bengio;& Haffner, 1998) 
 
This chapter explains all the relevant terms and concepts related to a convolutional 
neural network so that the reader would have a general idea of how convolutional neural 
networks work and how they are utilized in this master’s thesis. 
 
2.5.1 Input layer  
Convolutional neural networks are neural networks specialized in processing data with a 
grid-like topology, such as image data. The input layer is the first layer and the input of 
the whole convolutional neural network. In the context of a neural network applied to 
image processing, the input data is a pixel matrix of an image (Zhang ;Wang;Zhang ;Xu ;& 
Chen, 2019). 
 
32 
2.5.2 Convolutional layer 
Instead of general matrix multiplication, convolutional neural networks employ a 
mathematical operation called convolution in at least one layer of the architecture 
(Goodfellow;Bengio;& Courville, 2018). In the context of neural networks, convolution 
can be considered a mathematical operation that combines two sequences or sets of 
numbers to produce a third sequence. A 2D image's convolution operation is defined as 
the weighted sum of the components in a small window, known as a kernel or filter, 
moving over the image. These weights are learned parameters modified in the kernel 
during training to detect features in the image such as edges, lines, and corners 
(Zhang ;Wang;Zhang ;Xu ;& Chen, 2019). This is employed in the convolutional layer, the 
second layer of a CNN architecture.  The output of the convolutional layer is often 
referred to as a feature map containing real numbers (Campesato, 2020). 
 
 
Picture 10 Convolution between an image data matrix and kernel 
 
Depending on the nature of the computer vision application, image processing can 
improve the model’s performance significantly. Image processing is therefore a crucial 
part of training a deep learning model for a computer vision application. This is achieved 
by applying filters or kernels to the original image. The purpose of these filters is to 
process the image, e.g., remove noise, blur an image, extract edges, remove objects or 
33 
highlight objects (Nixon & Aguado, 2020). For example, the kernel in picture 10 is called 
a Sobel, usually used to highlight edges in an image  (Sanida;Sideris;& Dasygenis, 2020). 
 
2.5.3 Feature maps 
During the second stage of the object detection process, feature maps generated by the 
convolutional layer are processed through non-linear activation functions, typically a 
Rectified Linear Unit (ReLU) activation function. The ReLU activation function sets 
negative values in the feature maps to zero, introducing non-linearity in the network. 
This helps the network identify and highlight important object features, ultimately 
improving the precision and effectiveness of the object detection system (Campesato, 
2020). 
 
2.5.4 Pooling layer 
The third stage of convolutional neural networks is the pooling layer which has the role 
to merge semantically similar features from the output of the previous layer into one 
(LeCun;Bengio;& Geoffrey, Deep Learning, 2015). This is achieved with a pooling 
function that replaces the output of a feature map with a statistical summary of the 
neighbouring output values. There are several pooling functions used in convolutional 
neural networks such as max pooling, average of a rectangular neighbourhood, the L2 
norm of a rectangular neighbourhood, or a weighted average based on the distance from 
the central pixel (Goodfellow;Bengio;& Courville, 2018).    
34 
 
Picture 11 Max pooling  
 
A typical pooling function is max pooling which is performed by applying a max filter to 
non-overlapping sub-regions of the original output i.e., taking the maximum value of 
these regions. Applying any kind of pooling to a makes the pooled output invariant to 
smaller changes in the input image (LeCun;Kavukcuoglu;& Farabet, 2010).   
 
2.5.5 Fully connected layer 
The fourth layer of a convolutional neural network is the fully connected layer which 
combines the information of the former layers (Zhang , Wang, Zhang , Xu , & Chen, 2019). 
In the fully connected layer, also known as the dense layer, all possible connections from 
layer to layer are present, meaning that every input of the input vector affects every 
output in the output vector. The number of fully connected layers and neurons in these 
layers varies depending on the architecture (Arora, Garg, & Gupta, 2020).  
 
35 
  
Picture 12 A visualization of a fully connected layer (Kost;Altabey;Noori;& Taher, 2019) 
 
The fully connected layer represents the neural network part of a convolutional neural 
network where deep learning-related methods such as backpropagation are applied 
(Ramsundar & Zadeh, 2018). These were discussed in chapter 2.4.  
 
2.5.6 Output layer 
The last layer of a typical convolutional neural network architecture is a fully connected 
SoftMax layer that computes the network’s output. The output is essentially a score 
probability for each defined class (Zhang ;Wang;Zhang ;Xu ;& Chen, 2019).  The SoftMax 
function used to calculate the probability is a mathematical function that converts the 
last layer of numbers in the CNN into a vector of probabilities, where the probabilities of 
each value are proportional to the relative scale of each value in the vector. The 
probability is calculated with function 3 where 𝑧 is the input vector, zi is the input vector's 
element values, and K is the number of classes (Wood, 2023). 
         
𝜎(𝑧)𝑖 =  
𝑒𝑧𝑖
∑ 𝑒
𝑧𝑗𝐾
𝑗=1
  (3) 
 
 
 
36 
2.6 Overfitting and underfitting 
Two major factors affecting the performance of a machine learning model are overfitting 
and underfitting. The general purpose of machine learning is to build a model that can 
fit the training data and make correct predictions for untrained data samples not only 
the training data. Overfitting and underfitting have both an impact on the model’s ability 
to make correct predictions for samples outside the training data. Minimizing both 
factors can be quite challenging since they both influence each other (Li;Yan;& Xu, 2021). 
 
Overfitting means that a machine learning model starts to learn or memorize random 
regularity in the training dataset. The model might perform well on the training data but 
is not performing well on evaluation data since it is unable to generalize unseen training 
data. Underfitting is the opposite of overfitting and occurs when the model is incapable 
of learning features from a training dataset and therefore performs poorly on the 
training data (Amazon, 2023). 
 
 
Picture 13 Graphs visualizing under fit, balanced, and overfitting of a machine learning 
model (Amazon, 2023) 
 
Even though, overfitting and underfitting are known challenges and are often warned 
about in books and research written about machine learning actual theory related to this 
topic remains relatively underdeveloped. Theory related to overfitting and underfitting 
relies mainly on general knowledge gained in practice instead of any official set of criteria 
that would determine whether an algorithm will overfit or underfit a given dataset 
(Bashir;Mantanez;Sehra ;Segura;& Lauw, 2020).  
 
37 
Minimizing both overfitting and under fitting is hard and over time many general 
methods have been proposed in different research to overcome this challenge without 
changing the general architecture of the model by increasing or reducing the number of 
neurons in the network (Li;Yan;& Xu, 2021). To understand how to avoid overfitting and 
underfitting and how machine learning models are optimized in general this chapter 
discuss a few parameters that have an effect on the performance of a model and are 
adjustable: the optimizer, learning rate, dropout, regularization, and data augmentation.  
 
2.6.1 Learning rate and optimizer 
Learning rate is one of the most important hyperparameters and has significant effect 
on the model’s ability to converge to an optimal solution. Determining the optimal 
learning rate happens usually with trial and error during the model’s training face. The 
learning rate is usually related to the optimizer algorithm meaning that different 
optimizer algorithms have different optimal learning rates (Li;Yan;& Xu, 2021). 
Optimizers were discussed in chapter 3.4.9.  
 
2.6.2 Dropout 
Dropout is a simple method developed to prevent a machine learning model from 
overfitting. Dropout is at its simplest form a technique where random neurons or units 
including all connections are dropped from a neural network during the training. 
According to a journal article “Dropout: A Simple Way to Prevent Neural Networks from 
Overfitting” (Srivastava;Hinton;Krizhevsky;Sutskever;& Salakhutdinov, 2014) this 
method improves the performance of neural networks on supervised learning explicitly 
in vision related tasks by reducing the networks sensitivity to certain features and 
improving the networks generalization ability. 
 
38 
 
Picture 14 A neural net before and after applying dropout 
(Srivastava;Hinton;Krizhevsky;Sutskever;& Salakhutdinov, 2014) 
 
Dropout is applied by specifying a dropout rate which then determines the number of 
neurons that are dropped out. The optimal rate is found with a trial-and-error method 
as are many other hyperparameters in deep learning and machine learning in general 
(Li;Yan;& Xu, 2021). 
 
2.6.3 Regularization 
Another common method for preventing overfitting when training machine learning 
models is regularization. Regularization is a method where an additional penalty is added 
to the loss function (chapter 3.4.5) resulting in a sparse parameter matrix that reduces 
the possibility of overfitting and improving the model’s generalization ability (Li;Yan;& 
Xu, 2021).  
 
There are two common regularization methods L1 and L2 regularization where both 
regularization methods impose a penalty on the magnitude of the model weights in a 
similar manner (Kamalov & Leung, 2020). The output of the loss function is calculated 
with function 4 where X is the input value, y the output value, ω is the weight matrix, 
Ω(ω) is the penalty term, and α is the rate of regularization. 
 
39 
𝐽(𝑤, 𝑋, 𝑦) = 𝐽(𝑤, 𝑋, 𝑦) + 𝛼Ω(𝑤) (4) 
 
 
L1 regularization penalized the weights with function 5 where Ω(ω) is calculated based 
on the absolute value of ωi, and L2 regularization penalized the weights with function 6 
where Ω(ω) is calculated based on the squared value of ωi. 
 
Ω(𝑤) =  ‖𝑤1‖ =  ∑ |𝑤𝑖|𝑖     (5) 
 
Ω(𝑤) =  ‖𝑤2
2‖ =  ∑ 𝑤𝑖
2
𝑖   (6) 
 
2.6.4 Data augmentation 
To train a well performing computer vision model with supervised learning requires a 
sufficient dataset of labelled images. In practice, it is often difficult to obtain enough 
high-quality images, especially diverse classes of images with labels. Data augmentation 
is a method that tackles this problem by generating more data from the existing dataset 
in order to improve the model’s performance and avoid overfitting (Yorioka;Kang;& 
Iwamura, 2020).  
 
 
Picture 15 Data augmentation achieved with Geometric deformation (Yorioka;Kang;& 
Iwamura, 2020) 
 
40 
There are several different data augmentation methods that can be divided into six 
different categories image transformation augmentation, image mixed augmentation, 
feature space augmentation, semi-supervised augmentation, virtual image generation, 
and intelligent image augmentation. All the methods belonging to these categories have 
a certain feature or attribute they modify when be applied to a training dataset (Li;Yan;& 
Xu, 2021). In practice these methods make copies of an image that is part of a training 
dataset and modify different parameters such as the brightness, contrast, saturation, 
size, and the angle of an image. A big benefit of data augmentation is that it does not 
only improve the model’s performance and helps avoid overfitting but also saves time 
and costs on data collection and data labelling (Shah, 2023). 
 
2.6.5 Error analysis 
A more methodological approach for detect overfitting and under fitting is to compare 
and analyse the error generated by the model between a training dataset and a separate 
test or validation dataset. The difference between these datasets is covered in more 
detail in the next chapter. In practice the comparison means that the model is trained on 
a specific data set and the errors it produces on that data is compared to errors on 
unseen data. When the number of errors in the inferences of a model decreases to a 
stable level on a training data and simultaneously errors on a separate test data set also 
behaves in the same manner, this usually implies that the model has generalized the data 
well without overfitting or underfitting. However, if the errors do not decrease on a 
training dataset nor on a test data set this is most likely due to underfitting where the 
model has not been able to generalize all required information from the data. Contrarily, 
if the number of errors continues to decrease on the training dataset but not on the test 
dataset, it is a sign of overfitting (Li;Yan;& Xu, 2021).  
 
2.7 Data collection 
Data collection is one of the most important and critical parts of a computer vision 
application development process. According to a survey made on data collection for 
41 
machine learning, this has recently become a critical issue since the amount of labelled 
data is usually not sufficient for the developed machine learning application, especially 
in deep learning. One of the most difficult processes in classical machine learning is 
feature engineering, where the user must comprehend the application and offer 
features for training models. Instead of having to manually create features, which is a 
crucial aspect of data preparation, deep learning can generate these features 
automatically. But in exchange, deep learning needs a larger labelled dataset (Roh;Heo;& 
Whang, 2021). 
 
Data collection can be thought of as a process consisting of the actual data gathering, 
data labelling, and possibly data acquisition by data augmentation or discovery 
(Roh;Heo;& Whang, 2021).  Gathering the data can be either done by recording with a 
camera or using existing public datasets. Labelling or annotating data is essentially a 
process where raw data, i.e., images or video, is annotated by specifying the context with 
labels that specify which data vectors the model must use for training with software 
intended for this (IBM, 2023).  
 
2.7.1 Datasets 
The collected data is typically divided into training data, validation data, and test data. A 
training dataset is a dataset consisting of labelled examples of the eventual domain of 
application used to train the model to fit the training data whereas validation data is 
data used to validate the performance of the model (Fisher, et al., 2014). According to  
Brownlee (2023) the terms “validation set” is used interchangeably with the term “test 
set” both usually referring to the data which is held apart from the training and used to 
optimize the model’s performance after each epoch.  However, validation set is the more 
common one in this context. Test data can be thought of as an own set of data typically 
used to perform the final validation of the model’s performance after completing the 
training by making sure that it is able to generalize well to unseen data (Baheti, 2023).  
 
42 
Split ratio between training, validation, test data really depends on the number of 
samples present in the dataset and the model. However, there is not really an optimal 
percentage that would determine how big each data set should be. Common ways to 
split the data is sixty to eighty percent training data and ten to twenty percent validation 
and testing data (Baheti, 2023). 
 
 
 
 
 
 
  
43 
3 Project plan 
Studies have proved that sufficient project planning is vital to a project’s success rate, 
especially when working on a software development topic where projects tend to fail 
due to insufficient planning (Serrador, 2013). A good project plan contributes toward a 
better outcome and further helps conduct this master’s thesis. A project plan for the 
intended computer vision application master’s thesis is laid out as a software 
requirements specification, also known as an SRS document, commonly used in software 
development projects (IEEE, 1984). 
 
3.1 System features and requirements 
When the automated guided vehicles (AGVs) start operations in new areas, new issues 
may arise that were not previously anticipated. Therefore, the requirements for this 
master's thesis are defined based on the known issues at the time of conducting the 
study. The performance of the system is constantly evaluated based on the available data 
which might bring new insights of the system's performance and potential areas for 
improvement may emerge. Hence, the focus of this thesis will be to address the 
challenges and issues currently known and to develop a solution to improve the AGV 
system's performance, based on the latest available data and best practices for the target 
company. 
 
3.1.1 Premises and functional requirements 
As explained in the introduction chapter, the current safety scanners in the lower part of 
the AGVs create a two-dimensional plane around the AGV, leaving a blind spot beneath 
and above the plane. One of the safety scanners is located in the front end on top of the 
AGV, creating a two-dimensional plane at a 45-degree angle towards the ground. 
Whenever an object breaks the laser beam within a specific range from a moving AGV, 
it causes the AGV to stop if the object is not removed. This means that everything that 
does not break the laser beam goes undetected. E.g., when forks of a forklift left on the 
44 
side of the AGVs path break the safety scanner’s laser beam, the AGV slows down and 
stops. However, the breaking does not happen instantly, meaning that the beam has 
moved forward enough for the forks not to break the beam anymore. Thereby, the AGV 
does not detect anything in its path and continues driving forward, eventually crashing 
into the forks. This applies to all objects that break the laser beam only for a second due 
to their shape or placement. 
 
Another common issue is that the AGVs drive too close to operating manual forklifts 
before stopping leaving little to no clearance for the manual forklift to turn and move 
out of the way. This could be solved by having remote controllers in all the manual 
forklifts operating in target company’s facilities that can be used to stop the AGVs. Having 
a computer vision application that can identify manual forklifts enables the 
implementation of an automatic system that stops the AGVs earlier when a manual 
forklift is detected. 
 
Based on the explained premises the minimum requirement for the application is that it 
would feature an object detection model that detect objects in the AGV’s path that it is 
not able to detect with it current laser scanners. In addition, all manual forklifts should 
be identified, for the purpose of leaving more clearance between the trucks. For the 
application to be functional, it must distinguish between built-in structures and movable 
objects. E.g., when the AGVs navigate between pallet racks in the warehouse, the racks 
should be identified as built-in structures and therefore ignored. However, since the 
premise is that an object detection model does only detect objects, it is trained to detect, 
this is self-explanatory. The objects which the model is trained to detect and classify can 
be determined with a flowchart presented in figure 1.  
 
If the current system does not recognize an object and there is an imminent risk for the 
AGV to encounter it and possibly crash into it when operating in the warehouse or 
production area, the computer vision application should recognize it and signal the AGV 
to stop. If the current scanners do not recognize an object but there is not a risk that the 
45 
AGVs would encounter it, the object should then be ignored as well as if the current 
system can detect it with its laser scanners. 
 
 
 
Figure 2 Process flowchart to determine detectable objects 
 
A list of objects that the computer vision application needs to recognize can be seen in 
table 2. This was acquired by collecting a comprehensive list of objects that the AGV 
could encounter when operating in target company’s facilities and processing each 
object with the process flowchart presented in figure 1. The complete list of objects is 
added to appendix 1. 
 
Object Imminent risk Trained 
Pallet Jack Yes Yes 
Rider Truck Yes Yes 
Reach Truck Yes Yes 
Counterbalance forklift Yes Yes 
Table 2 List of objects that the model is trained to detect and classify. 
 
46 
The application should signal the AGV to stop when an object is detected in the AGVs 
path within a certain distance of the AGV. The transmitted signal would consist of a 
command for the AGV to stop and possibly information about whether it was a manual 
forklift. As a real-time application, the delay between detecting an object and sending 
the signal should be minimal. The top speed of an AGV is approximately 1,2 m/s in the 
warehouse and factory area and 2,0 m/s in the hub way connecting the warehouse and 
factory. The camera module’s field of vision is about 4,00 meters. In order to determine 
the distance for when the AGV needs to start braking, the braking distance of the AGVs 
was measured experimentally for different speeds. The results are visible in figure 2 and 
3 where the braking distance is plotted against the speed when using “soft stop” and 
emergency break and compared to the field of view.  
 
  
Figure 3 Braking distance-speed graph for soft stop 
 
0,00
1,00
2,00
3,00
4,00
5,00
6,00
1,20 2,00
B
ra
ki
n
g 
d
is
ta
n
ce
 (
m
)
Speed (m/s)
Braking distance with soft stop functionality
Braking distance (m) Field of view (m)
47 
 
Figure 4 Braking distance-speed graph for emergency stop 
 
Soft stop is a functionality which purpose is to disengage an operating or stationary AGV 
from the system until it is re-engaged. The AGVs have one soft stop button located on 
both sides. Emergency stop is essentially as soft stop but stops a moving much faster and 
requires an additional reset of error messages when pressed. There are altogether four 
emergency stop buttons on each AGV. 
 
When using the stop functionality, the braking distance is between 1,80 m and 5,00 m, 
depending on the speed. The AGV’s braking distance is 0,6 m to 1,7 m when using the 
emergency brake functionality. When using the soft stop functionality, the braking 
distance exceeds the field of view when the speed of the AGVs exceeds approximately 
1,5 m/s with no delay in the system. This implies that the AGVs need to brake harder 
when speeds exceed about 1,5 m/s.  
 
Based on the tests, having a load on does not significantly affect the braking distance 
when using the soft stop functionality. Therefore, stopping the AGV with the soft stop 
functionality is adequate when operating in the warehouse and factory area. Based on 
the tests and calculations, the minimum requirement is that the system can detect an 
0,00
0,50
1,00
1,50
2,00
2,50
3,00
3,50
1,20 2,00
B
ra
ki
n
g 
d
is
ta
n
ce
 (
m
)
Speed (m/s)
Braking distance with emergency stop
Field of view (m) Braking distance (m)
48 
object and send the signal within the time t from when the object appears in the field of 
view. This is calculated with equation 7 where F is the field of view v speed and τ delay 
of the system.   
 
𝑡 =  
𝐹−((4𝑣−3)+𝑣𝜏)
𝑣
, (0 < 𝐹 ≤ 3, 0 ≤ 𝑣 ≤ 1,5 𝑎𝑛𝑑 0 ≤ 𝜏 ≤ 𝑡)   (7) 
  
 
Since the braking distance exceeds the field of view when exceeding 1,5 m/s, a 
functionality for activating emergency braking is required for the AGV to stop before 
crashing. However, since the AGV exceeds the speed of 1,5 m/s only in the hubway, 
which is only used for traveling between the warehouse and factory with forklifts, the 
probability of encountering unexpected obstacles there is relatively low. In addition, 
determining the AGV’s speed accurately for calculating a satisfactory threshold for 
activating the emergency braking would require additional external equipment. For 
these reasons having the computer vision system activated in the hubway would not be 
feasible at this time. 
 
Since the braking distance is 0,6m when using emergency break, all obstacles that 
appear in the AGV’s field of view within 1,8 m from the AGV would be avoided by 
activating the emergency break in these cases. The activation should happen within the 
time t2 from when the object appears in the field of view. This is calculated with equation 
8 where F is the field of view v speed and τ delay of the system where F is the field of 
view v speed and τ delay of the system. 
 
𝑡2 =  
𝐹−((1,375𝑣−1,05)+𝑣𝜏)
𝑣
    (8) 
 
 
49 
3.1.2 Non-functional requirements 
The ability to identify all possible objects that could appear in the AGV’s path is not 
implemented in this master’s thesis due to the immersive amount of data required to 
train a model for that purpose. The requirement is that the object detection model can 
detect all objects not recognized by the current safety scanners and would cause an 
eminent safety risk if crashed into.   
 
The current safety scanners are used as the primary system, and the application 
developed in this master’s thesis works as a backup. I.e., if the safety scanners detect 
something in its path it will stop even though the computer vision application does not 
detect anything and vice versa. This way the detection of obstacles in the AGV’s path is 
not dependent on only one system. 
 
To meet target company’s data security requirements and General Data Protection 
Regulations set by the EU, none of the video material filmed by the computer vision 
application is saved on the device itself or anywhere else. The device is not connected 
to any network and cannot be accessed with a remote connection which ensures the 
data security of the device. However, the benefits of saving the video footage are 
discussed but not implemented since developing a prototype of the computer vision 
application for this master’s thesis is still in the face of concept testing.   
 
3.1.3 External interface requirements 
The requirement for the object detection model is that it can be implemented on an 
advanced AI-embedded system to meet the requirement of a real-time application. One 
possible solution for this could be Nvidia’s Jetson embedded platform which is one of 
the top platforms used for autonomous machines and other embedded applications. 
The Jetson Nano developer kit is a powerful embedded platform capable of running 
multiple neural networks in parallel for a computer vision application with relatively low 
power consumption (Nvidia, 2022).  
50 
 
Sending the signal to stop the AGV could be implemented by replicating a remote 
controller which is used to soft stop the AGVs. This would essentially be an electronic RF 
transmitter module connected to the embedded platform. Electricity for powering the 
system would be provided with a power supply capable connected to the AGV. The 
requirement for a embedded system of this nature is 5V and 2A or 5V and 4A if 
performing high-performance computational tasks (Ximea, 2022). 
 
The device should be both compact and easily maintainable, while also enabling 
development i.e., it should feature components that are easily replaceable. The device 
comprises multiple components, including a casing, camera module, embedded 
platform, RF transmitter module, and a fan for cooling. To facilitate efficient production, 
the physical casing for the camera module and embedded system may be manufactured 
using the target company's additive manufacturing laboratory. 
  
3.1.4 Quality attributes 
Since this application is a real-time safety system, it should be able to detect all objects 
in its path with close to one hundred percent certainty. Therefore, the current safety 
scanners work as the primary system, and the computer vision application as the backup. 
Utilizing computer vision as part of the AGV’s navigation system should improve safety 
without affecting the system’s overall performance. In practice, the system should not 
cause unnecessary stops falsely detecting, e.g., traces in the floor as objects, since this 
would extend the AGV’s pallet delivery time and decrease the AGV system’s overall 
performance. I.e., the number of false predictions should be as low as possible. 
 
 
51 
4 Implementation 
This chapter presents all the methods and tools used to carry out the practical part of 
this master’s thesis being the actual training of the object detection model required for 
the computer vision application based on the requirements specification defined in 
chapter 2. All methods and tools used were mainly chosen based on a prior computer 
vision application project implemented earlier on target company’s premises. Other 
factors affecting these decisions were prior experience and available documentation. A 
short overview of YOLO in general and Yolov7 is presented so that the reader is able to 
understand the kay factors and metrics used to evaluate the results.  
 
The tools utilized in this section was open-source software, including Label Studio and 
Jupyter Hub. Label Studio is a web-based platform that facilitates the efficient 
annotation of large datasets for machine learning applications. Jupyter Hub, on the other 
hand, provides an interactive computing environment that enables the user to create 
and share code in a web browser. The object detection algorithm employed in this study 
is YOLOv7. The selection of these tools was based on their proven effectiveness in similar 
studies, as well as their accessibility and compatibility with the research objectives.  
 
Figure 5 Development process 
 
52 
4.1 Yolov7 
YOLO is a convolutional neural network that has evolved from its initial version YOLO to 
the most recent version of YOLO, version 7 published in 2022. The latest version has 
several enhancements but shares key concepts the earlier versions. Yolov7 is a real time 
object detector that according to Wang et al. (2022) surpasses all known object detectors 
in both speed and accuracy. Yolov7 was selected for this master’s thesis since it is a real 
time object detection algorithm capable of detecting moving objects with high accuracy 
and is scalable which makes it suitable for e.g., embedded systems. In addition, Yolov7 
has delivered promising results in another object detection project implemented on 
target company’s premises where it was used to determine whether a pallet was full or 
empty (Sormunen, 2023). 
 
4.1.1 Architecture 
The general YOLO architecture is composed of a backbone, a neck, and a head. In short 
terms the backbones main task is to extract essential features from the data set and feed 
them forward to the head via the neck. The next collects feature maps from the 
extracted features and creates feature pyramids of them. The head consists of output 
layers that make the final detections (Chuyi, et al., 2022).   
 
 
53 
Figure 6 Yolov3 architecture with introduction to PP-Yolo features (Xiang, ym., 2020) 
 
YOLOv7 introduces several architectural improvements that increase both efficiency and 
accuracy. One of the major architectural changes presented in the Yolov7 paper is the 
introduction of a new computational block E-ELAN (Extended Efficient Layer Aggregation 
Network) in the YOLOv7 backbone which utilizes expand, shuffle, and merge cardinality 
in order to improve the learning ability of the network without destroying the original 
gradient path (Wang;Bochkovskiy;& Mark Liao, 2022). 
 
 
Figure 7 Extended efficient layer aggregation network (Wang;Bochkovskiy;& Mark Liao, 
2022) 
 
Another major improvement introduced with Yolov7 is model scaling. Model scaling 
allows to scale an already designed model to fit different type of computing devices by 
adjusting scaling factors, such as resolution (size of input image), depth (number of layer), 
width (number of channel), and stage (number of feature pyramid), in order to achieve 
a reasonable trade-off between the amount of network parameters, computation, 
inference speed, and accuracy (Wang;Bochkovskiy;& Mark Liao, 2022). This is a 
noteworthy feature considering the intended end use of the object detection model 
trained in this master’s thesis. 
 
Bag of freebies is a set of techniques or methods that change training strategy or training 
cost in order to improve accuracy of the model. Bag of freebies was initially introduced 
54 
in the Yolov4 paper and can be viewed as a broad framework of training methods for 
improving an object detection model's overall accuracy. These methods include 
activations, bounding box regression loss, data augmentation, and regularization 
method (Chien-Yao, Hong-Yuan, & Bochkovskiy, 2020). 
 
4.1.2 Loss function 
The loss function used in Yolov7 is composed of three parts bounding box loss, 
objectiveness loss, and classification loss (Wang;Bochkovskiy;& Mark Liao, 2022). All 
these three are modulated by some scalar parameter or IoU score between the model’s 
prediction and a ground truth. Bounding box loss measures the intersection over union 
between the predicted and target bounding box calculated with equation 9, where Bgt 
is the ground-truth, and B is the predicted box (Zhaohui, ym., 2020).  
 
𝐼𝑜𝑈 =  |
𝐵 ∩ 𝐵𝐵𝑔𝑡
𝐵 ∪ 𝐵𝐵𝑔𝑡
| 
(9) 
 
Objectiveness loss measures the objectness which is essentially the probability that an 
object exists in a proposed region of interest. I.e., when the objectivity is high, the image 
window contains an object with high probability. Whereas classification loss measures 
how well the model can predict the correct class of a given object (Kasper-Eulaers, ym., 
2021).  
 
On earlier version a loss function called focal loss has been experimented with in order 
to avoid overfitting. Focal loss is a function that concentrate on the examples where the 
model fails rather than the ones where it can confidently predict. Focal loss makes sure 
that predictions on challenging examples get better over time rather than getting too 
confident with simple ones (Chuyi, ym., 2022). 
 
55 
4.1.3 Metrics used with YOLOv7 
Other metrics that are commonly used to evaluate an object detection model’s 
performance is recall, precision, F1 score, and mean average precision. These are all 
metrics that is given as results by YOLOv7 after a training run.  
 
Recall measures how well the model predicts all existing objects and is calculated with 
function 10, where TP is true positive, TN true negative, FP false positive, and FN false 
negative. Whereas precision measures how accurate the predictions are and is 
calculated with function 11 (Sazanita Isa;Rosli;Yusof;Maruzuki;& Sulaiman, 2022). 
        
𝑅𝑒𝑐𝑎𝑙𝑙 =  
𝑇𝑃
𝑇𝑃 + 𝐹𝑁
 
 
(10) 
  
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =  
𝑇𝑃
𝑇𝑃 + 𝐹𝑃
 
(11) 
  
 
F1 score is essentially a metric that combines precision and recall measuring the model’s 
accuracy. The F1 score is calculated as the harmonic mean of the precision and recall 
with function 12 (Lipton;Elkan;& Naryanaswamy, 2014). 
 
𝐹1 𝑆𝑐𝑜𝑟𝑒 =  
2𝑇𝑃
2𝑇𝑃+𝐹𝑃+𝐹𝑁
  (12) 
 
     
Mean average precision (mAP) is calculated by finding the mean average precision (AP) 
over all classes. Mean average precision (mAP) is calculated with function 13, where N is 
the number of classes and AP the average precision (Sazanita Isa;Rosli;Yusof;Maruzuki;& 
Sulaiman, 2022).. 
 
56 
𝑚𝐴𝑃 =  
1
𝑁
∑ 𝐴𝑃𝑖
𝑁
𝑖=1   (13) 
 
 
4.1.4 Data augmentation 
Mosaic data augmentation was introduced in Yolov4 which is also present in the latest 
version. Mosaic data augmentation is a method where four training images is mixed 
which allows the model to detect objects outside their normal context (Chien-Yao;Hong-
Yuan;& Bochkovskiy, 2020). 
 
 
 
Picture 16 Example of a training batch with mosaic data augmentation used to train the 
model 
 
Additionally, YOLOv7 has a greater resolution than previous versions by processing 
images at a resolution of 640 by 640 pixels. This improves the model’s capability to detect 
smaller objects (Wang;Bochkovskiy;& Mark Liao, 2022). 
 
 
57 
4.2 Data collection 
The data used for training the YOLOv7 object detection model was collected by recording 
the foreground of an operating AGV with a camera attached to the front end of the AGV. 
The camera was attached approximately to the same spot as the final embedded device 
would be placed. Majority of the data was collected by filming the AGV’s daily operations 
in target company’s facilities and by cutting out clips from the video data with relevant 
information about the objects defined in chapter 2.3.1. The final scope of objects that 
were selected to be part of the training set is presented in table 2. 
 
The goal is eventually to gather a comprehensive and expressive enough data set so that 
the model is able to generalize the different objects in various surroundings. I order to 
achieve this a set of videos were recorded by intentionally placing objects of each class, 
on the AGVs path so that they would be visible for the camera. The camera used in the 
data collection was a Waltter 4k action camera and the videos was taken at 30 FPS with 
a resolution of 1080x1920 pixels.  
 
A total of 26 separate videos containing 131 minutes of video data was recorded during 
the data collection process of which 6,7 minutes was used as part of the final training, 
validation, and testing data sets. 
 
4.3 Data preparation 
Label Studio which was used to annotate the data only enables to export the annotated 
video data in JSON format which is not compatible with Yolov7. Therefore, an extra step 
was required to convert the data to Yolov7 compatible format using a Python script that 
extracts the information from the JSON file and creates a .txt file containing the data for 
each annotation. Yolov7 consumes the annotation data from a text file containing 
information about the class, annotation box place and size in the respective annotated 
picture. In addition, each frame from the video data had to be extracted and saved as 
58 
individual images with the same index as the respective text file containing the extracted 
frame’s annotation data. 
 
The labels and respective images were split into a training, validation, and test set using 
a python script. The training set contained approximately 70 percent of the data and the 
validation and test set 15 percent each. To further minimize the potential for overfitting 
each set contained data from separate videos. The validation data was only used to 
monitor the model’s performance and by having unseen data as validation data gives 
more reliable results. The total number of annotated images by class in each set is visible 
in tables 3, 4, and 5.   
 
4.4 Label studio 
The collected data was annotated using Label Studio, an open-source data labelling 
platform intended for image classification, object detection, and semantic segmentation. 
Label studio can be hosted locally but the data was labelled using a target company 
hosted instance. Label Studio’s free version has a simple and easy-to-use user interface 
with the basic features for image and video annotation which was enough for the 
purpose of this master’s thesis. 
 
 
59 
Picture 17 Label Studio user interface 
 
Label studio has a feature which allows linear interpolation between frames in a video, 
i.e., the annotations were done on videos instead of individual images. This sped up the 
annotation process since a linearly moving object needs to be annotated in only one 
frame within regular intervals. Each of these annotations works as a keyframe which is 
used by the Label Studio software to interpolate between the keyframes (Heartex, Inc., 
2023). All the data used in this master’s thesis was annotated using this feature with an 
interval of 10 frames. 
 
A general good practice when annotating is to include the target object from corner to 
corner inside the annotation box. This should be applied consistently to all annotations 
present in the dataset. In this master’s thesis, there was only one person annotating the 
data which ensured relatively consistent annotation practices which on its own usually 
results in higher quality labelling data (Liao;Kar;& Fidler, 2021). 
 
 
Picture 18 Example of a linear motion labelling sequence in Label Studio 
 
 
4.5 Model training 
When the dataset was ready to be used for training the YOLOv7 object detection model 
the next step was to set up a suitable training environment. To train an object detection 
model such as YOLOv7 requires an environment with proper libraries installed. The 
model was trained using a Jupyter notebook an interactive computing platform that was 
created on a JupyterHub server which is essentially a web-based notebook development 
60 
environment that runs in the cloud (Jupyter , 2023). This section covers all the steps 
required to setup the training environment and the actual training of the model. 
 
4.5.1 Training preferences 
The latest version on YOLOv7 was downloaded to the Jupyter notebook from YOLOv7’s 
GitHub repository and all the required libraries was installed using the “requirements.txt” 
file in the YOLOv7 repository. The notebook was created on a JupyterHub server with a 
NVIDIA Tesla T4 GPU available for training. All Python commands used to set up the 
environment are visible in attachment 2.  
 
The goal was to evaluate how well YOLOv7 performs on the given dataset based on a few 
general evaluation metrics being the classification, objectiveness, and accuracy of the 
model. The performance was tested by experimenting in a trial-and-error fashion with 
different parameters for each run with the objective to improve the model’s capability 
to recognizes each class and general accuracy. The parameters that were adjusted 
between the runs was the batch size, number of epochs, dataset and hyperparameters 
initial learning rate, final learning rate, and mosaic augmentation.  
 
During this phase the model was trained a total of ten times with three different datasets. 
The total training time added up to ca. 100 hours. All datasets and training runs with 
respective parameters are shown in tables 3,4,5, and 6. All other hyperparameters that 
were not altered during this phase are shown in attachment 3. 
 
Class Number of labeled images 
 Train Validation Test 
Rider-truck 3102 1084 602 
Total: 3102 1084 602 
Table 3 First dataset used for training. 
 
61 
Class Number of labeled images 
 Train Validation Test 
Counterbalance-truck 3550 788 1092 
Pallet-Jack 2945 856 822 
Rider-truck 3102 1084 602 
Reach-Truck 3010 1208 645 
Total: 15098 3936 3161 
Table 4 Second dataset used for training. 
 
Class Number of labelled images 
 Train Validation Test 
Counterbalance-truck 3550 788 1092 
Pallet-Jack 2945 856 822 
Rider-truck 3102 1084 602 
Reach-Truck 3500 1208 645 
Total: 15608 3936 3161 
Table 5 Final dataset used for training. 
 
4.5.2 Model optimization 
The initial training run was intended for evaluating how the model behaves on the given 
dataset and how long the training takes. The first dataset contained only labelled images 
of one class the rider-truck. After training with the first dataset and default 
hyperparameters for 10 epochs and a batch size of 9, the results confirmed that the 
model was behaving as expected and thereby the number of epochs and batch size was 
increased for the second run with the same dataset.  
 
 
62 
 Epochs Batch 
Size 
Initial 
learning 
rate 
Final 
learning 
rate 
Mosaic 
parameter 
Mix-
up 
Scale Paste 
in 
Run 1 10 9 0.01 0.1 1.0 0.15 0.9 0.15 
Run 2 55 16 0.01 0.1 1.0 0.15 0.9 0.15 
Run 3 30 16 0.01 0.1 1.0 0.15 0.9 0.15 
Run 4 60 16 0.01 0.1 1.0 0.15 0.9 0.15 
Run 5 30 9 0.01 0.1 1.0 0.15 0.9 0.15 
Run 6 30 16 0.01 0.1 1.0 0.15 0.9 0.15 
Run 7 30 16 0.01 0.1 0.5 0.15 0.9 0.15 
Run 8 30 16 0.001 0.01 0.5 0.15 0.9 0.15 
Run 9 30 16 0.001 0.01 1.0 0.15 0.9 0.15 
Run 10 30 16 0.001 0.01 0.5 0.1 0.7 0.05 
Table 6 All training runs with respective parameters. 
 
The objectness loss of the model decreased steadily and the precision, recall, mean 
average precision are increased over time, which indicates that the model can generalize 
the features of the one class present in the dataset. Based on the these results the 
dataset was expanded to contain labelled images of 4 different classes. The results of 
the second training run are visible in figure 8. 
 
 
63 
Figure 8 Results of second training run. 
 
In the third run trained with the new dataset containing four classes, the number of 
epochs was set to 30 and batch size to 16 all other hyperparameters was left as default. 
The results of the third run on the larger dataset showed some instability based on the 
precision and recall, however the model seems to converge when looking at the 
classification loss despite the relatively low number of epochs. The same model was 
further trained for another 30 epochs with the same parameters which did not have any 
significant impact on the results. However, the F1 score of the model gives a lower 
confidence for class “reach-truck”. Based on this information the number of labelled 
images in the training dataset as well as validation dataset was increased for this class. 
The fourth training run determined the average training time for one epoch which was 
approximately 24 minutes. The training was carried out overnight which means that it 
could not last longer than 12 hours. Thereby, based on this the upper limit for epochs 
per training run was set to 30 for the rest of the training runs since this is the point where 
the training time exceeds 12 hours. 
 
 
64 
Figure 9 F1 curve after the fourth training run. 
 
The results of training run five and six did show some improvement in confidence for 
class “reach-truck” based on the F1 score. However, the confusion matrix of the model 
shows that the model falsely predicts background as reach-truck which might be caused 
by incorrect labelling. This was checked by manually going through the dataset frame by 
frame in Label Studio which did not bring out any flaws in the dataset. According to Pham 
et al. (2022) unrealistic effects which might cause false predictions can be avoided by 
reducing hyperparameters scale, mosaic, mix-up, and paste in. The effects mosaic 
augmentation has on the results depends on the complexity of the target objects and 
background (Wang & Song, 2020). Reach-trucks has more complex features compared 
to the other classes and would thereby be prone to errors of this nature.  
 
 
Figure 10  Confusion matrix of training run six. 
 
65 
 
Lowering the mosaic hyperparameter probability to 0.5 form 1.0 in the seventh training 
run improved the model’s performance on detecting reach-trucks correctly. The model 
did however show some instability at the start of the training run which isn’t optimal 
when working with a limited number of epochs per training. By lowering the learning 
rate, it is possible to reduce the oscillation effect (Sazanita Isa;Rosli;Yusof;Maruzuki;& 
Sulaiman, 2022).  
 
 
Figure 11 Results of the eighth training run with learning rate and lower mosaic 
parameter.  
 
Lowering the initial learning rate to 0.001 and final learning rate to 0.01 had a significant 
effect on the model’s behaviour. The model converged a lot faster than in the previous 
runs when looking at the training loss and mean average precision. The ninth training 
run was intended to experiment with the initial probability of mosaic augmentation but 
with a lower learning rate. The initial learning rate and final learning rate was set to the 
same as in the previous run. The results from this run were in align with the previous 
results.  The model stabilized and converged relatively fast, but some confusion in 
detecting reach-trucks was apparent in the confusion matrix. 
 
66 
In the last training run the data augmentation parameters mix-up, scale, and paste-in 
hyperparameters were lowered to see how it affects the model’s performance. Lowering 
the paste-in hyperparameter reduces the training time according to the YOLOv7 
documentation (Wang;Bochkovskiy;& Mark Liao, 2022). All hyperparameters of the last 
training run are visible in attachment 3. Learning rates was the same as in the two 
previous runs and mosaic parameter was set to 0.5. The results of the last run did not 
deviate significantly from the previous runs. The model did show some fluctuation in the 
beginning of the training run for precision and recall as well as for both mean average 
precision. Overall, the accuracy of the model was at the same level as for the previous 
runs. 
 
The performance of all models trained in this master’s thesis is presented in table 7 
where the precision, recall, and mAP@.5 of each model is visible. All other figures related 
to the results from the training runs mentioned in this chapter are shown in attachment 
4.  
 
 Precision Recall mAP@.5 
Model 1 0.956 0.826 0.87 
Model 2 0.974 0.902 0.968 
Model 3 0.9127 0.8899 0.8881 
Model 4 0.9316 0.8939 0.9002 
Model 5 0.9267 0.9174 0.9197 
Model 6 0.9228 0.9099 0.9346 
Model 7 0.9353 0.8808 0.9133 
Model 8 0.9074 0.9117 0.9335 
Model 9 0.9232 0.9192 0.9338 
Model 10 0.9193 0.9139 0.9393 
Table 7 Overall performance of each model. 
 
67 
4.5.3 Test dataset 
The models that have been trained thus far, has only seen data from the training dataset 
and validation dataset. I order to get more representative results of the models’ 
performance on unseen data the models are tested on the test dataset. Models 1 and 2 
are left out since the models was trained only on one class. Models 3 and 4 had a smaller 
dataset than the rest and are therefore left out. Based on the outcome of this test the 
best performing model is selected to be further evaluated in a simulated environment 
with situations that the AGVs most likely encounter when operating in a warehouse 
environment.  
 
The evaluation of the models’ performance is done based on the precision, recall,  mAP 
0.5, and F1 by class. All models are tested with the same batch size of 16 and the last 
weights the model had at the end of its training run. The results are shown in table 8 
where the highest value in each metric is as dark green, the second highest value light 
green, third highest as yellow, and lowest value as red. 
 
The best performing model based on the performance on the test dataset is the tenth 
model with the best overall score compared to the other models. The second-best 
performing model was the eighth model which did perform much better in recall and 
precision than the tenth but had lower scores in mean average precision and F1 score 
than the tenth model. The worst performing model was the ninth model with lowest 
values in both precision and mean average precision. Having trained the models for only 
30 epochs might not show the full potential of all models since some might require 
longer training time to fully converge due to different learning rates. 
 
68 
 
Figure 12 The tenth model’s performance on validation and test dataset and average 
performance of all models. 
  
0,9193
0,9139
0,9393
0,912
0,902
0,923
0,9309
0,89647
0,92005
0,87
0,88
0,89
0,9
0,91
0,92
0,93
0,94
0,95
Precision Recall mAP@.5
Model performance
Validation Test Average
69 
 Class Precision Recall mAP@.5 F1 
Model 5 All 0.912 0.871 0.901 0.805 
 Rider-truck 0.918 0.987 0.979 0.927 
 Pallet-Jack 0.935 0.875 0.875 0.779 
 Reach-truck 0.858 0.963 0.947 0.86 
 Counterbalance-truck 0.938 0.661 0.804 0.654 
Model 6 All 0.907 0.901 0.901 0.797 
 Rider-truck 0.923 0.971 0.971 0.913 
 Pallet-Jack 0.934 0.875 0.873 0.757 
 Reach-truck 0.829 0.982 0.936 0.828 
 Counterbalance -truck 0.94 0.777 0.826 0.69 
Model 7 All 0.916 0.876 0.917 0.782 
 Rider-truck 0.95 0.947 0.97 0.875 
 Pallet-Jack 0.836 0.875 0.857 0.69 
 Reach-truck 0.915 0.94 0.949 0.843 
 Counterbalance -truck 0.962 0.74 0.893 0.719 
Model 8 All 0.925 0.903 0.907 0.807 
 Rider-truck 0.936 0.989 0.976 0.903 
 Pallet-Jack 0.95 0.859 0.874 0.791 
 Reach-truck 0.844 0.963 0.936 0.846 
 Counterbalance -truck 0.968 0.8 0.844 0.688 
Model 9 All 0.881 0.898 0.886 0.784 
 Rider-truck 0.916 0.977 0.965 0.898 
 Pallet-Jack 0.933 0.856 0.84 0.761 
 Reach-truck 0.759 0.982 0.924 0.829 
 Counterbalance -truck 0.918 0.777 0.816 0.648 
Model 10 All 0.912 0.902 0.923 0.827 
 Rider-truck 0.931 0.974 0.979 0.92 
 Pallet-Jack 0.935 0.875 0.868 0.793 
 Reach-truck 0.852 0.972 0.973 0.881 
 Counterbalance-truck 0.93 0.786 0.874 0.716 
70 
Table 8 Test results with the highest scores highlighted with respect to the metric.  
 
Parameter Value Information 
Algorithm YOLOv7 YOLOv7 version from   
github.com/WongKinYiu/yolov7 
on January 2023. 
Framework/Library YOLOR v0.1-121-g2fdc7f1  
Development 
platform 
JupyterHub All development, training, 
validation, and testing was 
carried out on a target company 
hosted instance of a JupyterHub 
notebook server 
CUDA Tesla T4, 15109.75MB GPU resource used to train the 
model. 
Dependencies See appendix 6 All dependencies are listed in 
appendix 6. 
Hyperparameters See appendix 5 or chapter 
4.5.2 
Adjusted experimentally, best 
performing model’s 
hyperparameters are available 
in appendix 5. 
Optimization  Experimentally Optimization was carried out 
experimentally by adjusting 
hyperparameters based on the 
model performance 
(Objectness, classification, 
precision, and recall).  
Validation Test dataset The model performance was 
validated on unseen test 
dataset. See chapters 4.5.3 and 
5.3. 
71 
General model 
architecture 
Default The general architecture of the 
model such as number of layers, 
hidden layers, etc. was left as 
default for the version of 
YOLOv7 used.  See 
github.com/WongKinYiu/yolov7. 
Layers 415 Number of layers in the model 
used for training. 
Gradients 37 212 738 Number of gradients in the 
model used for training. 
Computational 
complexity 
105.5 GFLOPS Computational complexity of 
the model used for training. 
Optimizer  Stochastic gradient 
descent (SGD) 
Optimizer used for training. 
Regularization Scaled weight decay = 
0.0005 
Scaled weight decay was not 
adjusted during training. 
Workers 8 Number of workers used during 
training. 
Epochs See table 6 in chapter 
4.5.2 
Number of epochs was limited 
to 30 for all training runs. 
Batch Size See table 6 in chapter 
4.5.2 
Batch size was limited to 16 due 
to lack of computational 
resources.  
Classes 4 Number of classes explained in 
chapter 3.1.1. 
Image Size 640x640 Default input image size when 
using CUDA. 
Training data 15608 images (65%) Consist of images from multiple 
videos of the objects in the 
classification scope. 
72 
Validation data 4703 images (19,8%) Consist of images from multiple 
videos of the objects in the 
classification scope. Completely 
different footage than in the 
training dataset. 
Test data 3439 images (0,145%) Consist of images from multiple 
videos of the objects in the 
classification scope. Completely 
different footage than in the 
training and validation dataset. 
Annotation tool Label studio v1.7.0 Used to annotate all data used 
for training, validation, and 
testing.  
Annotation format 
 
PyTorch TXT Converted from JSON format to 
PyTorch TXT using a python 
script and saved in respective 
files and folder according to 
YOLOv7 requirements. See 
chapter 4.3.  All scripts available 
on request. 
Image resolution 1080x1920 pixels All footage used for training was 
recorded using a Waltter 4k 
action camera at 30 FPS with a 
resolution of 1080x1920 pixels. 
Model - Trained model weights available 
on request. 
Table 9 Summary table of the trained model 
73 
5 Testing 
To get a better understanding of the trained model’s performance proper testing is 
essential. Object detection model testing differ from traditional software testing and 
since the concept of object detection is relatively new, principled, and systematic 
methods for testing object detectors do not yet exist (Wang & Su, 2021). However, 
testing the object detection model’s performance is already performed during the 
training phase, since the amount of training is based purely on the performance of the 
model. Therefore, testing in this chapter concentrates on testing feature requirements 
defined in chapter 3. This chapter covers all steps and results of the testing. 
 
To fully measure how well the selected model can handle situations an operating AGV 
most likely encounter the model is tested in a simulated environment. Testing is carried 
out by running the model on a pre-recorded video which is essentially the same as 
feeding live data to the model. The video contains two different test cases for all four 
classes. These test cases are incidents that have occurred, close call situations or similar 
situations that the AGVs most likely will encounter based on the earlier incidents. In 
addition, the model is tested on how well it can identify a manual truck especially in 
between pallet racks which was a functional requirement for the application to be able 
to detect and identify a manual truck to leave more clearance. All test cases are 
visualized and explained in chapter 5.2. 
 
The model is also tested to evaluate how well it meets the quality attribute that the 
application should not lower the system’s overall performance by causing unnecessary 
stops with false predictions. I.e., the number of false predictions should be as low as 
possible. This is carried out by running the model on a video with footage of a regular 
AGV task with the exception that all instances of objects that the model is able to classify 
is cut out.  
 
74 
5.1 Test requirements 
The model’s performance is evaluated based on how well the model meets the 
requirements defined in chapter 3.3 system features and requirements. The 
requirements are that the AGV must detect all object specified in the scope of detectable 
object before the object is under 1,8 meters from the AGV when moving at a speed of 
1,2 m/s. Objects that the model must detect and identify are a ride-truck, pallet-jack, 
reach-truck, and counterbalance-truck. Each scenario or test case has three different 
evaluation criteria; “Object was detected”, “Object was identified”, and “Object was 
identified outside the range of 1,8 m”.  The result is either “pass” or “fail” for each 
evaluation criteria and if one of the criteria is a failure the overall test result for that test 
case is a “failure”. To pass a test case all the criteria have to be a “pass”.  
 
The line or limit that an object cannot surpass is estimated roughly by placing an object 
1,8 m from the AGV and used as reference. Figure 13 visualizes this situation where Hc 
height of the camera, θ is the pitch angle of the camera, Fw the field of view, and d the 
distance between an object and the AGV. This method applies only to situations where 
objects are placed on the ground. However, since the testing is intended to evaluate the 
overall performance of the model regarding how well it detects object hence the 
deficiency of not considering objects in the air. 
 
 
75 
Figure 13 Distance estimation between an object and the AGV. 
 
 
Picture 19 The horizontal black solid line represents the limit that an object cannot 
surpass, and the vertical dashed lines illustrates the width of the AGV. A pallet-
jack is classified with a confidence of 95%. 
 
The quality attribute for false predictions is tested by running the model on the video 
with a confidence threshold of 0.1, 0.3., 0.5, 0.7, and 0.9. The model is tested on 10 000 
frames. The results are analysed by counting the number of detections made by the 
model when the premise is that there should not be any. A detection that lasts for only 
couple of frames is not worthy of attention since it would not be recognized as an 
obstacle by the application. Therefore, only detections that lasts for 10 or more frames 
are considered as remarkable. Additionally, if there is a gap maximum of two frames, it 
is considered as a continuous detection.  
 
5.2 Test cases 
Test cases are selected based on knowledge about what kinds of situations an operating 
AGV most likely will encounter when operating or incidents where an AGV have crashed 
into some of the class objects. The situations vary between the different classes since 
76 
they all are used for different purposes and hence encountered in different places and 
from different angels. E.g., encountering a stationary counterbalance-truck in the path 
of an AGV is considerably rarer than pallet-jack or rider-truck. 
 
Pallet-jacks can be encountered basically everywhere from all possible angels since they 
are mobile equipment and hence tend to be left unsupervised in places they do not 
belong. AGVs can detect pallet-jacks with their current scanners if the whole pallet-jack 
is in the path including the shaft. If only the forks of a pallet-jack are in the AGV’s path 
they go undetected. Therefore, the model is tested on scenarios where a pallet-jack is 
placed so that only the forks are in the AGV’s path.  
 
 
Picture 20 Test cases for pallet-jack. 
 
Rider-trucks are also relatively mobile equipment and often left unsupervised in various 
places similarly as pallet-jacks. AGVs can detect the body part of a rider-truck with their 
current scanners but not the forks. Test cases for testing the model on rider-trucks are 
cases where an AGV encounters a stationary rider-truck with only the forks in the AGV’s 
path.  
77 
 
Picture 21 Test cases for rider-truck. 
 
Reach-trucks are most likely encountered between pallet racks when picking or placing 
pallets. Probability that an AGV would encounter a stationary reach-truck unsupervised 
in its path is negligible since they are parked in areas where AGVs do not operate. 
Additionally, a reach-truck operator seldomly must step out of the truck when operating 
in the areas AGVs operate in. AGVs can detect reach-trucks with their current scanners 
except if only the forks are in the AGV’s path. However, since this scenario is not relevant 
considering the probability, the model is tested on cases where it should identify a reach-
truck from a longer distance to leave more clearance if needed.  
 
 
78 
 
Picture 22 Test cases for reach-truck. 
 
Counterbalance-trucks are most likely encountered in the AGV’s path when used to 
move pallets from a place to another or organize pallets on the floor. Counterbalance-
trucks are also parked in areas where AGVs do not operate and are therefore rarely 
encountered stationary and unsupervised. AGVs are also able to detect counterbalance-
trucks with their current scanners but not the if only the forks are in its path. Probability 
that this would happen is higher than with a reach-truck and is therefore tested on the 
model. The second test case for counterbalance-truck is a scenario where an AGV 
encounters an operating counterbalance truck from the side which should be identified 
early to leave more clearance.  
 
79 
 
Picture 23 Test cases for counterbalance-truck. 
 
 
5.3 Test results 
Results for all test cases defined in chapter 5.2 are presented in table 10. The model 
passed the tests with a success rate of 100%, which indicates that the model preforms 
up to expectations. The number of test cases was relatively low but covered all relevant 
cases considering the requirements. The variation of the background is always very small 
due to the positioning of the camera, which means that the only notable difference 
between the different scenarios the positioning of the detectable object relative to the 
AGV. Having more test cases does not bring any additional value to the testing since it 
would fall in the category of measuring accuracy of the model, and this was already done 
when running the model on the test dataset.  
 
The model was able to detect and identify all classes within 19 to 27 frames from when 
it appeared in the field of view. The video was recorded at a frame rate of 30 fps which 
means that when the AGV is traveling at a speed of 1,2m/s, there is a 1 second or 30 
frame windows for the model to detect and classify the object when the field of view is 
3m. However, it is to be noted that the field of view is adjustable by changing the pitch 
80 
angle of the camera which gives the application more time to make the detection and 
sending a signal for the AGV to stop. A screenshot was taken of the moment where the 
object in question was initially detected and identified and the moment when it crossed 
the line for activating the brakes for all test cases. All related screenshots are visible in 
attachment 5. 
 
 Test case 1 Test case 2 
 Detected Identified >1,8m OVR Detected Identified >1,8m OVR 
Pallet-jack Pass Pass Pass Pass Pass Pass Pass Pass 
Rider-truck Pass Pass Pass Pass Pass Pass Pass Pass 
Reach-truck Pass Pass Pass Pass Pass Pass Pass Pass 
Counterbalance-
truck 
Pass Pass Pass Pass Pass Pass Pass Pass 
Table 10 Test results by class and evaluation criteria. 
 
 Test case 1 Test case 2 
 Time to detect object (frames) 
Pallet-jack 20 24 
Rider-truck 27 19 
Reach-truck 26 25 
Counterbalance-
truck 
22 24 
Table 11 Number of frames it took for the model to detect and classify the objects by 
test case.  
 
The second requirement for the testing was that the application should not lower the 
system’s overall performance by causing unnecessary stops with false predictions. The 
results are visible in tables 12 and 13. The model caused a total of 1347 false predictions 
when running with a confidence threshold of 0.1. Even with a confidence threshold of 
0.7 the model detected an object in 191 frames. However, notable is that the number of 
false detections and duration reduces when the threshold increases indicating that the 
81 
predictions have a low confidence. Nevertheless, the model would have caused 8 
unnecessary stops when run with a confidence threshold of 0.7 assuming that the 
detected object is on the AGV’s path within 1.8 meters form the AGV.  
 
Threshold Rider-truck Reach-
truck 
Counterbalance-
truck 
Pallet-
jack 
Total Accuracy 
Conf. 0.1 899 323 97 28 1347 86,53 % 
Conf. 0.3 486 123 38 2 649 93,51 % 
Conf. 0.5 190 44 6 
 
240 97,60 % 
Conf. 0.7 170 19 2 
 
191 98,09 % 
Conf. 0.9 25 
   
25 99,75 % 
Table 12 Results for testing number of false detection in frames. 
 
Threshold 0.3s to 1s 1s to 2s 2s to 3s 3s to 4s Total 
Conf. 0.1 21 5 1 3 30 
Conf. 0.3 11 3 2 1 17 
Conf. 0.5 5 2 
  
7 
Conf. 0.7 6 2 
  
8 
Conf. 0.9 1 
   
1 
Table 13 False detections by duration. 
 
 
0
100
200
300
400
500
600
700
800
900
1000
Conf. 0.1 Conf. 0.3 Conf. 0.5 Conf. 0.7 Conf. 0.9
False predictions
Rider-truck Reach-truck Counterbalance-truck Pallet-jack
82 
Figure 14 Number of false detections in frames by confidence threshold and object class. 
 
According to the testing results, the model demonstrates its ability to detect and classify 
all classes in a variety of situations quickly and accurately. It is possible to utilize the 
model as an obstacle detector for the current classes with a high degree of confidence. 
Nevertheless, the model's performance is affected by a significant number of 
unnecessary stops, caused by false positive detections, which decreases its overall 
usability. Thus, additional training and a more in-depth analysis of the training dataset 
to address incorrect labelling would be required to enhance the model's performance. 
 
 
 
83 
6 Results and observations 
This chapter interpret and explain the relevance of the findings in the context of what is 
previously known about the research problem being studied in this master’s thesis, as 
well as convey all new knowledge or ideas about the issue after taking the findings into 
account. The objective of this master thesis was to create a proof of concept for a 
computer vision application that is intended to improve the current AGV system’s 
performance and the research question that this master’s thesis attempts to answer is 
as follows: “Can object detection be utilized to improve the current automated guided 
vehicle system’s performance in terms of avoid crashing into obstacles it cannot detect 
with its current scanners?”. 
 
To answer this question a short literary review of computer vision and object detection 
was laid out to understand how the technology in question works on theory level. After 
a general understanding of the concept was acquired, a plan for the computer vision 
application was laid out in form of a requirement specification in order to properly define 
requirements the object detection model has to meet. This was then followed by the 
actual training of an object detection model for the purpose of this master’s thesis. 
Based on the literature it is safe to state that computer vision is a broad concept and has 
been widely utilized to solve various problems in different business areas and has a lot 
of potential considering the research problem discussed in this master’s thesis. The 
solution this master’s thesis proposes for solving the research problem, focuses merely 
on object detection in terms of training a model for detecting and identifying a limited 
number of specified obstacles. The solution is not fully dynamic considering the variety 
of obstacles an AGV could possibly encounter and not detect, but it supplements the 
limitations of the current system in a critical area and is easily scalable. 
 
In this master’s thesis a YOLOv7 object detection model has been experimentally 
validated for accuracy and examined as a potential solution to the current incapabilities 
of the AGV system. Based on the results of this master’s thesis, object detection could 
be applied as a potential solution to improve the current automated guided vehicle 
84 
system’s performance in terms of avoid crashing into obstacles it cannot detect with its 
current scanners, when considering the performance of the required object detection 
model. The best performing model trained in this master’s thesis measured a mean 
average precision of 0.9393 mAP, recall of 0.9139, and precision of 0.9193.  The model 
passed all test cases with a success rate of 100%, which means that it fulfils all 
requirements set for the application on a theoretical level.  The number of classes the 
model was trained to detect in this master’s thesis was only four different classes and 
having more classes would essentially require more data which would result in longer 
training times. A limiting factor during the conduction of the master’s thesis was the 
available resources for training the model in terms of computational power. The training 
was carried out outside regular office hours which limited the training time to 30 epochs 
for each model. However, scaling the model to detect new object require additional data 
and it is easily acquired by consistently recording daily operations of the AGVs in the 
same fashion as done during the data collection phase in chapter 4.2. By doing these 
new problematic obstacles could be identified from the data and thereby included in the 
model’s training dataset. 
 
As explained the solution this master’s thesis proposes is not dynamic in the sense of it 
being only able to detect obstacles the model is trained to detect. However, a classifying 
object detector enables profound actions based on the classified obstacles. I.e., when a 
specific obstacle or object is detected, the application can act accordingly e.g., leave 
more clearance when a manual forklift is in the AGVs path. As the results of the testing 
showed, a YOLOv7 object detector can detect and classifying objects fast and precisely 
by only seeing small portion of the object. In the testing the YOLOv7 object detector 
measured an average speed of 0.3ms per inference. For the application to be able to 
make decision based on the classification the model must perform with a high accuracy 
and should not confuse a specific object for another. The tenth model measured the 
highest overall F1 score of 0.827 for all classes. Surprising was that the model was 
measured a F1 score of 0.793 for pallet-jack even though it is a relatively simple object. 
85 
Although, the model showed a low F1 score for some of the classes it predicted objects 
correctly in the test video with a high confidence of 0.9 and higher.  
 
As this this master’s thesis has pointed out, data collection and annotation processes are 
both crucial considering the whole development process. The more consistent high-
quality data available for training, the better the outcome is, as can be seen when 
comparing the results between training runs three and five in figures 14 and 15, where 
the dataset was increased. This master’s thesis did not focus on fixing incorrect 
annotations or false predictions other than manually checking the correctness of the 
datasets used to train the model and running the model on a test video to analyse the 
number of false predictions. The results showed that the model would cause 
unnecessary stops in its current form due to it falsely detecting objects with a rate of 2.4% 
when the confidence threshold is 0.5 if the detection would happen on the AGVs path 
within 1.8m from the AGV. The same rate was 13.47% when running on a confidence 
threshold of 0.1. The model falsely classified unseen objects, such as a cleaning cart and 
a door with relatively high confidence of 0.9 and higher. This implies that the model 
would need to be further trained to increase the accuracy. The lower confidence false 
predictions were mostly marks in the floor and random pallets in the outer periphery. 
This could be explained with incorrect annotations or overfitting due to the data being 
rather homogeneous since data was collected in form of a video where the annotated 
frames do not deviate a lot from each other. This is also visible in picture 24 where an 
example of a training batch containing 16 frames that were used to train model number 
ten. To further improve the model’s performance in the future a proper tool intended 
for analysing the data to eliminate possible incorrect annotations would be 
advantageous.  
 
This master’s thesis did not experiment with the other lighter available versions of 
YOLOv7 such as YOLOv7-tiny running an edge GPU-oriented architecture optimized for 
inference on edge devices. The lighter version is expected to measure much faster 
inference speeds but slightly decreased performance in terms of accuracy 
86 
(Wang;Bochkovskiy;& Mark Liao, 2022). However, the proposed Nvidia hardware family 
can run a YOLOv7 object detector with the higher-end hardware but not suitable for not 
suitable for mobile device deployments due to the high computational requirements of 
YOLOv7. Nvidia’s Xavier AGX can deliver rates of 17 fps with YOLOv7, which is suitable 
for a real time industrial object detector for the purpose of this master’s thesis. Generally, 
the inference speed of the device should be 30 to 60 fps for an industrial real time object 
detector (Nguyen;Bae;Lee;Lee;& Kwon, 2022). Thus, depending on the speed an AGV is 
traveling and the pitch angle of the camera, it is enough that the device can deliver 
inference speeds of 12 fps and higher which translates to a travel distance of 10cm and 
less per frame. Being able to detect obstacles every 10cm is enough for the application 
to be functional.   
 
Based on the optimization process carried out during the training and the testing 
performed afterwards, it turned out that decreasing hyperparameters related to data 
augmentation improves the model’s performance when using a dataset like the one used 
in this master’s thesis. In the tenth training run the data augmentation parameters mix-
up, scale, and paste-in hyperparameters were lowered slightly. The effects of lowering 
these parameters where not clearly visible during the training process, but the based on 
the model’s performance on the test dataset it had a positive effect. A speculative 
conclusion can be made that the nature of the dataset used in this master’s thesis deliver 
better model performance with lower data augmentation effects since the background 
of the objects is always the same both in the training dataset and end use case. To 
confirm the findings more comprehensive testing would be required. 
 
The next step in order to implement the actual application would be to investigate the 
hardware requirements for implementing the communication between the application 
and the AGVs and piloting the trained YOLOv7 object detection model on embedded 
hardware. This master’s thesis could be used as a reference when defining how the 
computer vision application should interpret the detections made by the object detector 
in terms of functionality. The overall idea is clear but for the application to be fully 
87 
functional details like handling turns where objects are closer to the AGV would need be 
managed without affecting the system’s ability to detect obstacles. I.e., the application 
would need to identify that the AGV is turning and thereby narrow down the limit for 
activating the brakes if an obstacle is detected within this limit. 
 
88 
7 Conclusions 
The complex nature of the production and warehouse environment poses a significant 
challenge to the current automated guided vehicle system. The system's sensitivity to 
unexpected changes and challenges in the environment results in a slowdown of 
operations, which cannot be efficiently resolved with static functions. Given the dynamic 
and constantly evolving nature of the environment, an adaptive solution is necessary to 
address the multi-dimensional problem at hand. The use of an object detection 
application with high speed and accuracy has the potential to provide the necessary 
adaptability to overcome the challenges faced by the current system. 
 
As demonstrated in this master's thesis, preliminary work is crucial for the success of any 
project. Adequate practical knowledge of the target process, coupled with a 
comprehensive understanding of the underlying technology, enables the development 
of a well-performing object detector capable of addressing complex problems such as 
the one presented in this thesis. Computer vision enable development of more 
complicated autonomous systems but to fully leverage this technology it is essential for 
the developer to have a comprehensive understanding of the underlying technology. 
Ideally, in machine learning, the idea is to select a model at the sweet spot between 
underfitting and overfitting, or at least this is the goal but is very difficult to achieve in 
practice. Determining the problem and defining the requirements of the applications are 
essential but are only a part of the whole process. Collecting sufficient high-quality data 
for training the model is critical for achieving good model performance, assuming that 
annotations are performed through a well-defined process to eliminate inconsistency 
and incorrect labelling. 
 
As a result of the proposed solution for the research problem in this master’s thesis, the 
AGVs most likely stop more often, but by doing that they avoid crashing into obscure 
obstacles. With the help of a classifying object detector running on an embedded system 
it is possible to control the AGVs in a way that would streamline forklift traffic between 
narrow racking ails and elsewhere in the warehouse and factory areas. The outcome of 
89 
this master’s thesis did not only prove the potential and capability of a YOLOv7 object 
detector but had a positive influence on target company’s employees’ prejudice toward 
new technology and the AGVs themselves. Since the project was carried out on-site the 
employees working in the warehouse got hands-on experience on how a system like the 
AGV system could be enhanced. Most often, employees working on the shop floor have 
the best view of what the issues are, e.g., considering a manufacturing process or supply 
chain. As studies on resource-based view of strategy have pointed out, frontline 
employees’ operational improvement competence and creativity directly correlate with 
a company’s ability to achieve competitive advantages (Yang;K.C. Lee;& Cheng, 2016). 
This master’s thesis contributes directly to target company’s employees’ OIC by giving 
practical knowledge about applied machine learning and computer vision. Meaning that 
frontline employees of target company have improved premises to recognize potential 
use cases for computer vision in their line of work.  
 
Although the YOLOv7 object detector showed promising results as a potential solution 
for addressing the limitations of the current AGV system, it may be worth exploring a 
more dynamic approach, such as anomaly detection with computer vision, to 
compensate for the object detector's lack of adaptiveness. Anomaly detection involves 
identifying events or items that deviate from what is expected, in this case, an obstacle 
in the AGV's path (Cambridge Dictionary, 2023). In this approach, the AGV's movements 
are recorded and classified as a clear path. The generalized data would then be 
compared to a live video feed of the AGV's foreground, and if an obstacle appeared, it 
would deviate from the generalized data, resulting in detection. This method could 
complement the proposed classifying object detector and serve as an additional system 
to improve the AGV's navigation capabilities. 
 
The findings of this master's thesis suggest that developing and training an object 
detection model can be accomplished with moderate effort, even without extensive 
expertise in applied computer vision. This observation could have important implications 
for the target company, as it may encourage them to invest in similar projects that 
90 
leverage cutting-edge technologies, such as computer vision and machine learning, in 
business areas where these technologies have not been applied before. By doing so, the 
company could gain a competitive advantage in their industry, increase operational 
efficiency, and improve overall performance. 
 
The object detection model developed in this master's thesis has the potential to be 
utilized in other perceptual object detection applications requiring accurate 
classification of model inferences. By following the steps presented in the 
implementation phase it is possible to train a model that is able to classify objects with 
high accuracy and possibly even better if further trained with data from a similar 
environment. The trained model could serve as a foundation for comparable projects, 
thereby reducing the time and costs involved in preliminary research efforts. When 
training a similar object detection model, it is crucial to invest on the quality of the 
training data by applying tools that helps both preparing and validating the correctness 
of the annotated data. Having tools for this purpose streamlines the data preparation 
process and saves time from analysing primary causes of unwanted model behaviour. 
Based on the amount of training time used in this master’s thesis to train a model the 
configurations of the model should preferably be defined based on other related work 
rather than relying on experimental optimization from scratch. 
 
The results of the training and testing conducted in this master’s thesis implied that 
reducing data augmentation parameters on a YOLOv7 model improved the model’s 
performance when operating in an environment where the background does not vary a 
lot. On this basis, future research related to object detection should further examine 
whether less data augmentation improves the performance of an object detection 
model in similar environments. This could possibly facilitate the training of better 
performing object detection models with shorter training times using similar data in 
terms of background features.  
 
91 
References 
Amazon. (2023). Amazon Machine Learning Developer Guide. Amazon Web Services. 
Amitha, M.;Amudha, P.;& Sivakumari, S. (2020). Deep Learning Techniques: An Overview. 
Advanced Machine Learning Technologies and Applications, 2021, Volume 1141, 
(ss. 599-608). 
Amos, B.;Xu, L.;& Kolter, J. Z. (2017). Input Convex Neural Networks.  
Arora, D.;Garg, M.;& Gupta, M. (2020). Diving deep in Deep Convolutional Neural 
Network. 2020 2nd International Conference on Advances in Computing, 
Communication Control and Networking. IEEE. 
Baheti, P. (17. 01 2023). Train Test Validation Split: How To & Best Practices [2023]. 
Noudettu osoitteesta v7labs: https://www.v7labs.com/blog/train-validation-
test-set 
Bashir, D.;Mantanez, G.;Sehra , S.;Segura, P. S.;& Lauw, J. (2020). An Information-
Theoretic Perspective on Overfitting and Underfitting. AI 2020: AI 2020: 
Advances in Artificial Intelligence (ss. 347-358). Springer, Cham. 
Bernico, M. (2018). Deep Learning Quick Reference : Useful Hacks for Training and 
Optimizing Deep Neural Networks with TensorFlow and Keras. Packt Publishing, 
Limited. 
Bottou, L. (2018). Online Learning and Stochastic Approximations. AT&T Labs–Research. 
Brownlee, J. (17. 01 2023). What is the Difference Between Test and Validation Datasets? 
Noudettu osoitteesta machinelearningmastery: 
https://machinelearningmastery.com/difference-test-validation-datasets/ 
Cambridge Dictionary. (13. March 2023). Meaning of anomaly in English. Noudettu 
osoitteesta Cambridge Dictionary: 
https://dictionary.cambridge.org/dictionary/english/anomaly 
Campesato, O. (2020). Artificial Intelligence, Machine Learning, and Deep Learning. 
Mercury Learning & Information. 
Chien-Yao, W.;Hong-Yuan, M. L.;& Bochkovskiy, A. (2020). YOLOv4: Optimal Speed and 
Accuracy of Object Detection.  
92 
Chokmani, K.;Khalil, B. M.;Ouarda, T. B.;& Bourdages, R. (2007). Estimation of River Ice 
Thickness Using Artificial Neural Networks. Quebec: CGU HS Committee on River 
Ice Processes and the Environment. 
Chuyi, L.;Lulu, L.;Hongliang, J.;Kaiheng, W.;Yifei, G.;Liang, L.;. . . Xiaolin, W. (2022). 
YOLOv6: A Single-Stage Object Detection Framework for Industrial.  
Dadhich, A. (2018). Practical Computer Vision : Extract Insightful Information from 
Images Using TensorFlow, Keras, and OpenCV. Mumbai: Packt Publishing. 
Davies, E. (2017). Computer Vision : Principles, Algorithms, Applications, Learning . 
London: Elsevier Science & Technology. 
Fisher, R. B.;Breckon, T. P.;Breckon, T. P.;Dawson-Howe, K.;Fitzgibbon, A.;Robertson, C.;. . . 
Williams, C. K. (2014). Dictionary of Computer Vision and Image Processing. 
Hoboken John Wiley & Sons, Incorporated 2014. 
Goodfellow, I.;Bengio, Y.;& Courville, A. (2018). Deep learning. The MIT Press 2018. 
Haji, S. H.;& Abdulazeez, A. M. (2021). COMPARISON OF OPTIMIZATION TECHNIQUES 
BASED ON GRADIENT DESCENT ALGORITHM: A REVIEW. Duhok: Duhok 
Polytechnic University. 
Heartex, Inc. (28. 01 2023). Noudettu osoitteesta Label Studio: https://labelstud.io/ 
IBM. (17. 12 2022). What is Artificial Intelligence (AI)? Noudettu osoitteesta IBM CLoud 
Learn Hub: https://www.ibm.com/cloud/learn/what-is-artificial-intelligence 
IBM. (16. 01 2023). data-labeling. Noudettu osoitteesta IBM: 
https://www.ibm.com/topics/data-labeling 
IEEE. (1984). IEEE Guide to Software Requirements Specifications. New York: The Institute 
of Electrical and Electronics Engineers. 
Jupyter . (19. 01 2023). JupyterHub. Noudettu osoitteesta Jupyter: https://jupyter.org/ 
Kamalov, F.;& Leung, H. (2020). Deep learning regularization in imbalanced data. 2020 
International Conference on Communications, Computing, Cybersecurity, and 
Informatics (CCCI) (ss. 1 - 5). Sharjah: IEEE. 
Kasper-Eulaers, M.;Hahn, N.;Berger, S.;Sebulonsen, T.;Myrland, Ø.;& Kummervold. 
(2021). Short Communication: Detecting Heavy Goods Vehicles in RestAreas in 
Winter Conditions Using YOLOv5. Algorithms 2021, 14, 114. 
93 
Kost, A.;Altabey, W. A.;Noori, M.;& Taher, A. (2019). Applying Neural Networks for Tire 
Pressure Monitoring Systems. ResearchGate. 
Kuleshov, V.;& Ermon, S. (2017). Deep Hybrid Models: Bridging Discriminative and 
Generative Approaches.  
LeCun, Y.;Bengio, Y.;& Geoffrey, H. (May 2015). Deep Learning. Nature 521, 436-44. 
LeCun, Y.;Bottou, L.;Bengio, Y.;& Haffner, P. (1998). Gradient-based learning applied to 
document recognition. Proceedings of the IEEE, 2278 - 2324. 
LeCun, Y.;Kavukcuoglu, K.;& Farabet, C. (2010). Convolutional Networks and Applications 
in Vision. International Symposium on Circuits and Systems (ISCAS 2010). Paris. 
Li, Q.;Yan, M.;& Xu, J. (2021). Optimizing Convolutional Neural Network Performance by 
Mitigating Underfitting and Overfitting. 2021 IEEE/ACIS 19th International 
Conference on Computer and Information Science (ICIS) (ss. 126 - 131). Shanghai: 
IEEE. 
Liao, Y.-H.;Kar, A.;& Fidler, S. (2021). Towards Good Practices for Efficiently Annotating 
Large-Scale Image. 2021 IEEE/CVF Conference on Computer Vision and Pattern 
Recognition (CVPR) (ss. 4348-4357). Nashville: IEEE. 
Lipton, Z. C.;Elkan, C.;& Naryanaswamy, B. (2014). Thresholding Classifiers to Maximize 
F1 Score. San Diego: University of California. 
Mayershofer, C.;Holm, D.-M.;Molter, B.;& Fottner, J. (2022). LOCO: Logistics Objects in 
Context. 2020 19th IEEE International Conference on Machine Learning and 
Applications (ICMLA). Miami: IEEE. 
McCarthy, J. (2007). WHAT IS ARTIFICIAL INTELLIGENCE? Stanford: Computer Science 
Department Stanford University. 
Nguyen, H.-V.;Bae, J.-H.;Lee, Y.-E.;Lee, H.-S.;& Kwon, K.-R. (2022). Comparison of Pre-
Trained YOLO Models on Steel Surface Defects Detector Based on Transfer 
Learning with GPU-Based Embedded Devices. Sensors 2022. 
Nixon, M.;& Aguado, A. (2020). Feature Extraction and Image Processing for Computer 
Vision. London: Elsevier. 
94 
Nvidia. (15. 11 2022). Edge Computing. Noudettu osoitteesta Nvidia: 
https://www.nvidia.com/en-us/autonomous-machines/embedded-
systems/jetson-nano/education-projects/ 
Park, K.-Y.;& Hwang, S.-Y. (2014). Robust Range Estimation with a Monocular Camera for 
Vision-Based Forward Collision Warning System. The Scientific World Journal. 
Pham, V.;Nguyen, D.;& Donan, C. (2022). Road Damages Detection and Classification 
with YOLOv7. Huntsville: Computer Science Department Sam Houston State 
University. 
Pomerat, J.;Segev, A.;& Datta, R. (2019). On Neural Network Activation Functions and 
Optimizers in Relation to Polynomial Regression. 2019 IEEE International 
Conference on Big Data (ss. 6183-6185). IEE: Los Angeles. 
Pragati, B. (28. March 2023). Activation Functions in Neural Networks [12 Types & Use 
Cases]. Noudettu osoitteesta v7labs: https://www.v7labs.com/blog/neural-
networks-activation-functions 
Ramsundar, B.;& Zadeh, R. B. (2018). TensorFlow for Deep Learning. Sebastopol: O’Reilly 
Media. 
Roh, Y.;Heo, G.;& Whang, S. E. (2021). A Survey on Data Collection for Machine Learning: 
A Big Data - AI Integration Perspective. IEEE Transactions on Knowledge and Data 
Engineering, vol. 33, no. 4, 1328-1347. 
Rumelhart, D. E.;Hinton, G. E.;& Williams, R. J. (09. 10 1986). Learning representations 
by back-propagating errors. Nature, 533-536. 
Sanida, T.;Sideris, A.;& Dasygenis, M. (2020). A Heterogeneous Implementation of the 
Sobel Edge Detection Filter Using OpenCL. 2020 9th International Conference on 
Modern Circuits and Systems Technologies (ss. 1-4). IEEE: Bremen. 
Sazanita Isa, I.;Rosli, M.;Yusof, U.;Maruzuki, M.;& Sulaiman, S. (2022). Optimizing the 
Hyperparameter Tuning of YOLOv5 for Underwater Detection. Open Access 
Journal. 
Serrador, P. (2013). The Impact of Planning on Project Success-A Literature Review. 
Researchgate. 
95 
Shah, D. (14. January 2023). data-augmentation-guide. Noudettu osoitteesta v7labs: 
https://www.v7labs.com/blog/data-augmentation-guide 
Solving. (23. 10 2022). Products. Noudettu osoitteesta Solving: 
https://www.solving.com/ 
Sormunen, T. (2023). Pallet detection in warehouse environment. Tampere: Aalto 
University. 
Srivastava, N.;Hinton, G.;Krizhevsky, A.;Sutskever, I.;& Salakhutdinov, R. (2014). Dropout: 
A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine 
Learning Research, 1929-1958. 
Szeliski, R. (2022). Computer Vision : Algorithms and Applications. Cham Springer 
International Publishing AG 2022. 
Wang, C.-Y.;Bochkovskiy, A.;& Mark Liao, H.-Y. (2022). YOLOv7: Trainable bag-of-freebies 
sets new state-of-the-art for real-time object. Institute of Information Science, 
Academia Sinica, Taiwan. 
Wang, H.;& Song, Z. (2020). Improved Mosaic: Algorithms for more Complex Images. 
Journal of Physics Conference Series 1684. 
Wang, S.;& Su, Z. (2021). Metamorphic Object Insertion for Testing Object Detection 
Systems. Proceedings of the 35th IEEE/ACM International Conference on 
Automated Software Engineering, 1053-1065. 
Wood, T. (10. 01 2023). Softmax Function. Noudettu osoitteesta DeepAI: 
https://deepai.org/machine-learning-glossary-and-terms/softmax-layer 
Xiang, L.;Kaipeng, D.;Guanzhong, W.;Yang, Z.;Qingqing, D.;Gao, Y.;. . . Shilei, W. (2020). 
PP-YOLO: An Effective and Efficient Implementation of Object Detector.  
Ximea. (19. 11 2022). Jetson_Nano_Benchmarks. Noudettu osoitteesta Ximea: 
https://www.ximea.com/support/wiki/apis/Jetson_Nano_Benchmarks 
Yang, Y.;K.C. Lee, P.;& Cheng, T. (2016). Continuous improvement competence, employee 
creativity, and new service development performance: A frontline employee 
perspective. The International Journal of Production Economics, 275-288. 
Yingjie, T.;Duo, S.;Stanislao, L.;& Xiaohui, L. (2022). Recent advances on loss functions in 
deep learning for computer vision. Neurocomputing, 129-158. 
96 
Yorioka, D.;Kang, H.;& Iwamura, K. (2020). Data Augmentation For Deep Learning Using 
Generative Adversarial Networks. 2020 IEEE 9th Global Conference on Consumer 
Electronics (GCCE) (ss. 516-518). Kobe: IEEE. 
Zhang , X.;Wang, Y.;Zhang , N.;Xu , D.;& Chen, B. (2019). Research on Scene Classification 
Method of High-Resolution Remote Sensing Images Based on RFPNet. Beijing : 
University of Chinese Academy of Sciences. 
Zhao, S.;Louidor, E.;& Gupta, M. (2022). Global Optimization Networks. Proceedings of 
the 39th International Conference on Machine Learning, (ss. 26927-26957). 
Zhaohui, Z.;Ping, W.;Wei, L.;Jinze, L.;Rongguang, Y.;& Dongwei, R. (2020). Distance-IoU 
Loss: Faster and Better Learning for Bounding Box Regression. 12993-13000, (ss. 
12993-13000). 
Zhong-Qiu, Z.;Peng, Z.;Shou-Tao, X.;& Xindong, W. (209). Object Detection With Deep 
Learning: A Review. IEEE Transactions on Neural Networks and Learning Systems 
(ss. 3212 - 3232). IEEE. 
 Appendix 1. List of objects AGVs encounter when operating 
Object Does the 
current system 
recognize it? 
Is there a 
risk for the 
AGV to 
encounter 
and crash 
into it? 
Ignored Trained 
to 
recognize 
Additional information 
AGV Yes No Yes No  
AGV for a machine 
assembly line Yes No Yes No  
Counterbalance forklift No Yes No Yes  
Dock door Yes No Yes No  
Hanging large objects 
above 2m No Yes Yes No  
Hanging large objects 
below 2m Yes No Yes No  
Order Picker No Yes No Yes  
Other transportation 
devices Yes No Yes No  
Pallet Yes Yes Yes No  
Pallet Jack No Yes No Yes  
Pallet truck No Yes No Yes  
Person Yes No Yes No  
Reach Truck No Yes No Yes 
Does not damage the 
AGV 
Smaller wooden parts Yes/No Yes Yes No 
Does not damage the 
AGV 
Traffic cone Yes No Yes No  
Trash Yes/No Yes Yes No  
Trash bin Yes No Yes No  
Wooden box Yes No Yes No Not visible for the camera 
 
 
 Appendix 2. Python commands 
!python script.py --input LatestDataset.json --names 
labelNames.txt --output labels 
 
!python script2.py --input Reach-truck-val.mp4 --output 
images --frame-rate 30 
 
!git clone https://github.com/WongKinYiu/yolov7 
 
!pip install -r requirements.txt 
 
!pip3 install opencv-python-headless==4.5.3.56 
 
!wget 
https://github.com/WongKinYiu/yolov7/releases/download/v0.1
/yolov7_training.pt 
 
import os 
from sklearn.model_selection import train_test_split 
path1 = "./images-train/" 
path2 = "./images-validation/" 
path3 = "./images-test/" 
 
image_names1 = os.listdir(path1) 
image_names2 = os.listdir(path2) 
image_names3 = os.listdir(path3) 
 
train_set = os.listdir(path1) 
val_set = os.listdir(path2) 
test_set = os.listdir(path3) 
 
 
from time import time 
 
def print_time(t: float): 
    print(f'Took {time()-t:.2f} s') 
 
def train_val_test_split(image_names: list[str], sample_rate: 
int, val_perc: int, test_perc: int) -> tuple[str, str, str]: 
    print(f'Splitting to train, val, test with sample rate 
of {sample_rate}') 
    t = time() 
    train_names, val_names = 
train_test_split(image_names[::sample_rate], 
test_size=(val_perc + test_perc) / 100, shuffle=False) 
    val_names, test_names = train_test_split(val_names, 
test_size=(test_perc) / (val_perc + test_perc), 
shuffle=False) 
    print_time(t) 
    return train_names, val_names, test_names 
2 
 
def save_image_names1(names: list[str], filename: str, 
folder_path) -> None: 
    file_path = f'{folder_path}/{filename}' 
    print(f'Saving image names to {file_path}') 
    t = time() 
    names_str = '' 
    for name in names: 
        names_str += IMAGES_PATH + name + '\n' 
    os.makedirs(folder_path, exist_ok=True) 
    with open(file_path, 'w') as f: 
        f.write(names_str) 
    print_time(t) 
     
     
     
IMAGES_PATH = './images/' 
SPLITS_PATH = './splits/' 
 
SAMPLE_RATE = 1 
 
image_names_folder_path = 
f'{SPLITS_PATH}/sample_{SAMPLE_RATE}' 
 
save_image_names1(train_set, 'train.txt', 
image_names_folder_path) 
save_image_names1(val_set, 'val.txt', 
image_names_folder_path) 
save_image_names1(test_set, 'test.txt', 
image_names_folder_path) 
 
 
!python train.py --batch 16 --epochs 30 --data data/data.yaml 
--weights './runs/train/exp107/weights/last.pt' --device 0 
 
!python test.py --
weights  ./runs/train/exp37/weights/last.pt --task test --
data data/data.yaml --batch 16 --device 0 
 
!python detect.py --
weights ./runs/train/exp108/weights/last.pt --conf 0.9 --
source FalsePredTest.mp4 
 
 
 
 
 
 
 
 
 
 Appendix 3. Hyperparameters of the tenth model 
lr0: 0.001  # initial learning rate (SGD=1E-2, Adam=1E-3) 
lrf: 0.01  # final OneCycleLR learning rate (lr0 * lrf) 
momentum: 0.937  # SGD momentum/Adam beta1 
weight_decay: 0.0005  # optimizer weight decay 5e-4 
warmup_epochs: 3.0  # warmup epochs (fractions ok) 
warmup_momentum: 0.8  # warmup initial momentum 
warmup_bias_lr: 0.1  # warmup initial bias lr 
box: 0.05  # box loss gain 
cls: 0.3  # cls loss gain 
cls_pw: 1.0  # cls BCELoss positive_weight 
obj: 0.7  # obj loss gain (scale with pixels) 
obj_pw: 1.0  # obj BCELoss positive_weight 
iou_t: 0.20  # IoU training threshold 
anchor_t: 4.0  # anchor-multiple threshold 
# anchors: 3  # anchors per output layer (0 to ignore) 
fl_gamma: 0.0  # focal loss gamma (efficientDet default 
gamma=1.5) 
hsv_h: 0.015  # image HSV-Hue augmentation (fraction) 
hsv_s: 0.7  # image HSV-Saturation augmentation (fraction) 
hsv_v: 0.4  # image HSV-Value augmentation (fraction) 
degrees: 0.0  # image rotation (+/- deg) 
translate: 0.2  # image translation (+/- fraction) 
scale: 0.9  # image scale (+/- gain) 
shear: 0.0  # image shear (+/- deg) 
perspective: 0.0  # image perspective (+/- fraction), range 
0-0.001 
flipud: 0.0  # image flip up-down (probability) 
fliplr: 0.5  # image flip left-right (probability) 
mosaic: 0.5  # image mosaic (probability) 
mixup: 0.1  # image mixup (probability) 
copy_paste: 0.0  # image copy paste (probability) 
paste_in: 0.05  # image copy paste (probability), use 0 for 
faster training 
loss_ota: 1 # use ComputeLossOTA, use 0 for faster training 
 
 Appendix 4. Training results 
 
Figure 15 Results of the third training run. 
 
 
Figure 16 Results of the fifth training run with a larger dataset. 
 
2 
 
Figure 17  F1 score of the fifth training run. 
 
  
Figure 18 Results of seventh training run with lower mosaic hyperparameter. 
 
3 
 
Figure 19 Confusion matrix after the seventh training run. 
 
 
Figure 20 Results of the ninth training run with lower learning rate and initial mosaic 
parameter. 
 
4 
 
Figure 21 Confusion matrix of the ninth training run. 
 
 
Figure 22 Results of the tenth training run with adjusted augmentation parameters. 
 
5 
 
Figure 23 Confusion matrix of the tenth training run. 
 
6 
 
Picture 24 Example of a training batch used to train model number ten. 
 Appendix 5. Test cases 
 
Picture 25 Test case 1 for pallet-jack. 
 
Picture 26 Test case 2 for pallet-jack. 
 
2 
 
Picture 27 Test case 1 for rider-truck. 
  
Picture 28 Test-case 2 for rider-truck. 
 
3 
 
Picture 29 Test case 1 for reach-truck. 
 
Picture 30 Test case 2 for reach-truck. 
 
4 
 
Picture 31 Test case 1 for counterweight-truck. 
 
Picture 32 Test case 2 for counterweight-truck. 
 
  
Appendix 6. Dependencies 
# Base ---------------------------------------- 
matplotlib>=3.2.2 
numpy>=1.18.5,<1.24.0 
opencv-python>=4.1.1 
Pillow>=7.1.2 
PyYAML>=5.3.1 
requests>=2.23.0 
scipy>=1.4.1 
torch==1.7.1,!=1.12.0 
torchvision>=0.8.1,!=0.13.0 
tqdm>=4.41.0 
protobuf<4.21.3 
 
# Logging ------------------------------------- 
tensorboard>=2.4.1 
# wandb 
 
# Plotting ------------------------------------ 
pandas>=1.1.4 
seaborn>=0.11.0 
 
# Export -------------------------------------- 
# coremltools>=4.1  # CoreML export 
# onnx>=1.9.0  # ONNX export 
# onnx-simplifier>=0.3.6  # ONNX simplifier 
# scikit-learn==0.19.2  # CoreML quantization 
# tensorflow>=2.4.1  # TFLite export 
# tensorflowjs>=3.9.0  # TF.js export 
# openvino-dev  # OpenVINO export 
 
# Extras -------------------------------------- 
ipython  # interactive notebook 
psutil  # system utilization 
thop  # FLOPs computation 
# albumentations>=1.0.3 
# pycocotools>=2.0  # COCO mAP 
# roboflow 
# opencv-python-headless==4.5.3.56