Mikael Metsälä 

Possibilities of convolutions in AI-reconstructed 
music 

 
Vaasa 2024 

School of Technology and  
Innovations  

Master’s thesis in Computing  
Sciences 


2 

UNIVERSITY OF VAASA 
School of Technology and Innovations  
Author:    Mikael Metsälä 
Title of the Thesis:  Possibilities of convolutions in AI-reconstructed music 
Degree:    Master of Science in Technology 
Programme:   Automation and Computer Science 
Supervisor:   Teemu Mäenpää 
Year:    2024 Page count: 71 

ABSTRACT: 
 
This thesis investigates the application of convolutional layers within an autoencoder to 
reconstruct one-dimensional audio data in systems with limited computational resources. The 
primary objective of this study is to explore whether convolutional layers could improve 
autoencoder performance by retaining key audio characteristics during the reconstruction 
process. While deep generative models have shown promise for audio synthesis, research has 
predominantly focused on large-scale implementations, leaving open questions about the 
adaptability of these approaches to smaller systems. This study hypothesized that convolutional 
layers would enable improved reconstructions compared to fully connected (FC) layers within a 
limited VRAM environment. 
 
To test this hypothesis, a controlled experimental approach was employed, which involved a 
detailed comparison of the performance of both fully connected and convolutional 
architectures. Each model was trained from scratch on one-dimensional audio sequences until 
reaching convergence. This approach allowed for a clear and precise evaluation of the relative 
effectiveness of each model type. To ensure a comprehensive assessment, several key metrics 
were selected, including mean squared error as one of the primary metrics, alongside observa-
tions of convergence rate and memory efficiency to evaluate model performance. 
 
The findings indicate that the convolutional autoencoder achieved superior reconstruction qual-
ity, as evidenced by its lower mean squared error and faster epoch-wise progression to accuracy, 
despite taking slightly longer per epoch than the FC model. These results highlight convolutional 
architectures' potential to facilitate high-quality audio reconstruction on smaller systems, mak-
ing advanced AI-driven audio analysis more accessible. The convolutional model’s ability to rep-
resent low-frequency components more effectively and with less added noise than the FC model 
supports the hypothesis, although challenges, such as limitations in replicating high-frequency 
components, were noted in both models. Overall, these results suggest that convolutional au-
toencoders could offer a promising approach for efficiently reconstructing audio data on con-
strained hardware. 
 
The study contributes valuable insights to music analysis and AI audio research, particularly in 
the context of scalable model design for low-resource environments. It acknowledges 
limitations, such as subjective sound quality assessment and hardware constraints, and 
recommends future work. Further research might focus on enhancing frequency representation 
within convolutional networks and improving audio separation capabilities. By advancing 
methods that operate effectively on smaller systems, this study encourages further exploration 
of accessible AI applications in music analysis and digital audio processing. 
 

KEYWORDS: audio processing, artificial intelligence, machine learning, neural networks 


3 

VAASAN YLIOPISTO 
Tekniikan ja innovaatiojohtamisen akateeminen yksikkö 
Tekijä: Mikael Metsälä 
Tutkielman nimi: Possibilities of convolutions in AI-reconstructed music 
Tutkinto: Diplomi-insinööri 
Oppiaine: Automaatio ja tietotekniikka 
Työn ohjaaja: Teemu Mäenpää 
Valmistumisvuosi:       2024 Sivumäärä: 71  

TIIVISTELMÄ: 
 
Tässä tutkielmassa tarkastellaan rajatuilla laskentaresurseilla toimivan konvoluutiokerroksia 
hyödyntävän autoenkoodaajan soveltamista audiosignaalin rekonstruointiin. Tavoitteena on 
selvittää, voivatko konvoluutiokerrokset parantaa autoenkoodaajan oppimiskykyä ja auttaa sitä 
säilyttämään musiikille ominaisia piirteitä rekonstruointiprosessin aikana. Aiemmissa tutkimuk-
sissa on todistettu syvien generatiivisten mallien kyky audiosynteesissä, kun käytössä on ollut 
valtavasti laskentatehoa ja muistia, mikä jättää kysymyksen avoimeksi pienemmän laskentate-
hon omaavien järjestelmien osalta. Hypoteesina tässä tutkimuksessa on, että konvoluutioker-
rokset voivat tarjota parempaa rekonstruktiota kuin täysin kytketyt kerrokset rajallisesti keskus-
muistia sisältävissä järjestelmissä. 
 
Hypoteesin testaamiseksi toteutettiin vertailukoe, jossa verrattiin täysin kytketyistä kerroksista 
koostuvan neuroverkon ja konvoluutiopohjaisten verkon suorituskykyä. Molemmat mallit kou-
lutettiin audiodatan avulla, kunnes ne saavuttivat konvergenssin. Näin saatiin selkeä ja tarkka 
vertailu arkkitehtuurien tehokkuudesta. Mallien suorituskykyä arvioitiin ensisijaisesti keskineliö-
virheen avulla, ja lisäksi tarkasteltiin konvergenssinopeutta ja käytetyn muistin määrää. 
 
Tutkimuksen tulokset osoittavat, että konvoluutiokerroksia sisältävä autoenkoodaaja rekonst-
ruoi audiosignaalia paremmin, mikä käy ilmi sen matalammasta keskineliövirheestä sekä sen 
tuottamasta pienemmästä kohinan määrästä. Näiden tulosten perusteella konvoluutioarkkiteh-
tuuri osoittaa potentiaalia korkealaatuisen audionsignaalin rekonstruointiin laskentateholtaan 
rajatuissa järjestelmissä, mikä parantaa tällaisten tekoälyyn perustuvien järjestelmien saavutet-
tavuutta. Molemmissa malleissa havaittiin haasteita korkeiden taajuuksien rekonstruoinnissa.  
 
Johtopäätöksenä voidaan todeta, että konvoluutiokerrokset parantavat autoenkoodaajan kykyä 
rekonstruoida audiosignaalia, erityisesti matalilla taajuuksilla ja vähentämällä kohinaa, mikä 
mahdollistaa mallin käyttämisen myös laskentateholtaan rajatuissa järjestelmissä. Tämä osoit-
taa konvoluutioon pohjautuvien arkkitehtuurien potentiaalin laadukkaaseen audiodatan re-
konstruointiin ja mahdollistaa tekoälyn soveltamisen musiikkianalyysissä ja äänenkäsittelyssä 
laajemmalle yleisölle. Tulevissa tutkimuksissa voitaisiin keskittyä konvoluutiomallien kykyyn ero-
tella eri taajuuskomponentteja entistä tarkemmin sekä parantaa niiden suorituskykyä korkeiden 
taajuuksien käsittelyssä. 
 
 
AVAINSANAT: audio processing, artificial intelligence, machine learning, neural networks 


4 

Contents 

1 Introduction 9 

2 Basic concepts and technologies 12 

2.1 Categorisation 12 

2.1.1 Parameter-based 12 

2.1.2 Non-parameter-based 13 

2.2 Deep Neural Networks 13 

2.2.1 Model 13 

2.2.2 Dataset 15 

2.2.3 Objective Function 17 

2.2.4 Optimisation Procedure 19 

2.3 Neural Network Architectures 20 

2.3.1 Autoencoders 20 

2.3.2 Variational Autoencoders 21 

2.3.3 Generative Adversarial Networks 23 

2.3.4 Transformers 25 

3 Related Work 29 

3.1 WaveNet 29 

3.1.1 Causal and Dilated Convolutions 30 

3.1.2 Non-linear Quantisation 31 

3.2 Jukebox 32 

3.2.1 Multi-scale VQ-VAE 32 

3.2.2 Scalable Transformers and Upsampling 34 

3.3 Summary 35 

4 Methodology 38 

5 Experiment Design and Model Evolution 40 

5.1 Dataset and System 41 

5.2 Model Evolution 42 

5.2.1 Preprocessing 42 


5 

5.2.2 Model development 46 

5.2.3 Testing setup 49 

5.3 Results 50 

5.4 Analysis 59 

5.4.1 Common features 59 

5.4.2 Separating features 61 

6 Conclusion 66 

References 68 

  
6 

Pictures 
 
Picture 1. Screenshot of Task Manager during convolutional model training. 64 

Picture 2. Screenshot of Task Manager during base model training. 65 

 
Figures 
 
Figure 1. A simple feedforward neural network by Goodfellow et al. (2016, p. 170) 14 

Figure 3. Variational Autoencoder depicted as a figure. Blue boxes represent the encoder, 

white is the latent space, and green boxes represent the decoder. 𝒙 is input, and 𝒙 

represents output. 22 

Figure 2. Generative Adversarial Network architecture depicted by Bengesi et al. (2023).

 24 

Figure 4. Scaled Dot-Product Attention architecture (Vaswani et al. 2017). 26 

Figure 5. Transformer architecture (Vaswani et al. 2017). The figure consists of an 

encoder on the left and a decoder on the right. 28 

Figure 6. Different types of convolutions. Blue dots depict the input layer, grey marks the 

hidden layers, and yellow is the output layer. 30 

Figure 7. Effects of μ-law companding on a waveform. 45 

Figure 8. The encoder part of the Convolutional Autoencoder is in blue, and the first 

latent vector is in white. 48 

Figure 9. The decoder part of the Convolutional Autoencoder is in green, and the second 

latent vector is in white. 49 

Figure 10. Evolution of the validation loss during the test with global minima. The blue 

curve represents the base model’s performance, and the red curve represents the 

convolution model’s performance. 52 

Figure 11. Waveform of the validation audio. 55 

Figure 12. The base model’s reconstruction data is depicted as a waveform after Epoch 

0. 55 

Figure 13. The base model’s reconstruction data is depicted as a waveform after Epoch 

56. 55 


7 

Figure 14. The convolution model’s reconstruction data is depicted as a waveform after 

Epoch 0. 56 

Figure 15. The convolution model’s reconstruction data is depicted as a waveform after 

Epoch 75. 56 

Figure 16. This represents the frequency distribution and corresponding magnitudes of 

the validation data. 57 

Figure 17. Frequency distribution and magnitude of the base model's reconstruction 

data after Epoch 0. 57 

Figure 18. Frequency distribution and magnitude of the base model's reconstruction 

data after Epoch 56. 58 

Figure 19. Frequency distribution and magnitude of the convolution model's 

reconstruction data after Epoch 0. 58 

Figure 20. Frequency distribution and magnitude of the convolution model's 

reconstruction data after Epoch 75. 58 

 
Tables 
 
Table 1. Summary of changes in the base model's dependent variables. 53 

Table 2. Summary of changes in the convolution model's dependent variables. 54 

 
Abbreviations 
 
ADAM Adaptive Moment Estimation 
AI  Artificial Intelligence 
BERT  Bidirectional Encoder Representations from Transformer 
CNN  Convolutional Neural Network 
CV  Computer Vision 
EDM  Electronic Dance Music 
FT  Fully Connected 
GAN  Generative Adversarial Network 
GPT  Generative Pre-trained Transformer 
GPU  Graphics Processing Unit 
KL  Kullback-Liebler 
MAE  Mean Absolute Error Multilayer Perceptron 
MLP  Multilayer Perceptron 


8 

MSE  Mean Squared Error 
NLP  Natural Language Processing 
RAM  Random Access Memory 
ReLU  Rectified Linear Unit 
RFFT  Real Fast Fourier Transform 
RMSE  Root-Mean-Square Error 
RNN  Recurrent Neural Network stochastic gradient descent 
SGD  Stochastic Gradient Descent 
SNR  Signal-to-Noise Ratio 
STFT  Short-Time Fourier Transform 
VAE  Variational Autoencoder 
ViT  Vision Transformer 
VQ-VAE Vector Quantized Variational Autoencoder 
VRAM  Video Random Access Memory 


9 

1 Introduction 

Music generated with the help of Artificial Intelligence is a topic that has puzzled re-

searchers for decades. AI-generated music research took its first steps as early as the 

1950s when Hiller Jr. and Isaacson (1957) released a model that generated sheet music 

based on the Markov chain model. Many of these early parameter-based generative 

models were not multi-layered neural networks (Zhu et al., 2023), which at the time suf-

fered from a lack of efficient training methods (Briot et al., 2019, p. 41). In the 2006 

paper A Fast Learning Algorithm for Deep Belief Nets, Hinton et al. introduced a solution 

to this, paving the way for the rise of deep neural networks, and in 2012, AlexNet, a deep 

neural network, won the ImageNet image recognition competition, resulting in a para-

digm shift, making deep learning the state-of-the-art solution for prediction problems 

(Briot et al., 2019, pp. 3, 41–42). 

 
Deep learning is a vague term as it does not share a scientifically agreed-upon definition 

(Briot et al., 2019, p. 3). As a part of artificial intelligence, it usually refers to machine 

learning done with deep neural networks consisting of multiple layers that hierarchically 

extract and abstract data (Briot et al., 2019, p. 3). Briot et al. (2019, p. 3) highlight three 

major milestones that have fuelled the surge of deep learning: an increase in the quan-

tity of data available, enhanced availability of computational resources, and technologi-

cal advances, notably the meaningful application of convolutions, which are particularly 

relevant to the context of this thesis. This is supported by Bengesi et al. (2023) as they 

identified that prior to 2010, interest in deep learning was hindered by the limited avail-

ability of computing resources and insufficient large datasets. 

 
After the major roadblocks were overcome and deep learning research gained wind in 

its sails, deep learning took a new course in 2013 when Kingma and Welling released 

Variational Autoencoder (VAE), followed by Goodfellow et al. (2014) with their Genera-

tive Adversarial Network (GAN), building the foundation for Generative Artificial Intelli-

gence. At the time of their introduction, these new types of neural networks aimed to 

capture the underlying probability distribution of the data (Goodfellow et al., 2016, pp. 


10 

693, 697). The benefit of learning the distribution is the possibility of sampling it and 

generating novel data instances that resemble the original data (Goodfellow et al., 2016, 

p. 707) 

 
These networks mainly operated with relatively small-resolution images where the input 

for the network was a complete image like the MNIST database of handwritten numbers 

(Goodfellow et al. 2014). One image from MNIST has a 28x28 resolution (LeCun et al., 

1998). However, a 2-minute song sampled with the usual 44,1 kHz has roughly 5 million 

input parameters compared to an image of the MNIST set, which has a little less than 

800 input parameters. As Dhariwal et al. (2020) note, it is very computationally demand-

ing. This has led to new solutions in which the input is split into segments that are fed to 

the network as separate instances.  

 
This “segmented learning” can still work for non-generative tasks. However, randomly 

sampling the latent space for multiple audio segments and combining them is unlikely 

to create a coherent song. This problem has given birth to autoregressive networks that 

calculate the probability of each new sample as a joint probability over all previous sam-

ples (van den Oord et al., 2016). Such networks are designed to generate long and co-

herent audio and usually consist of two separate neural networks, which are autoencod-

ing and autoregressive in nature (Dhariwal et al., 2020). The first layers of the autoen-

coder are often convolutional layers designed to maximise the receptive field of the net-

work, making it easier to model longer temporal dependencies (van den Oord et al., 

2016).  

 
The main aim of this thesis is to explore the use of convolutions for one-dimensional 

sequential audio, focusing on the regenerative capabilities of autoencoders in music. The 

inspiration for this study arises from the challenges identified by Dieleman et al. (2018), 

particularly the lack of long-term structure in AI-generated music. Previous research has 

demonstrated the feasibility of capturing local structures like timbre, but modelling 

higher-level structures, such as verses and choruses, remains elusive (Dieleman et al., 


11 

2018). However, generative AI systems, especially in music, typically require extensive 

computational resources to achieve coherent and high-quality output (Dhariwal et al., 

2020). This study aims to explore approaches that can be implemented on smaller sys-

tems, offering feasible solutions for researchers without access to vast computational 

resources. This study seeks to answer the question: Can convolutional architecture en-

hance the regenerative performance of autoencoder on one-dimensional audio data? 

This study hypothesizes that applying convolutions will improve the autoencoder's abil-

ity to capture and preserve essential structural features in the reconstructed output, 

thereby enhancing its understanding of relationships between preceding and succeeding 

data points. 

 
This study follows a controlled experimental design to explore the effects of different 

neural network architectures on AI-regenerated music. A controlled experiment allows 

for precise manipulation of independent variables—such as the type of neural network 

used—and careful observation of their effects on the dependent variables, including the 

quality and characteristics of the music. By creating a structured environment, this study 

aims to isolate specific factors and assess their impact on model performance. The meth-

odology employed in this thesis provides a systematic approach to validate the hypoth-

esis and contribute to the broader understanding of neural network architectures in re-

generative music through empirical evidence. 

 
This study is divided into six chapters. The second chapter consists of an overview of 

neural networks, aiming to give readers an understanding of the basic concepts and 

technologies surrounding the field. The third chapter examines two influential studies 

that inspired this research. The fourth chapter discusses the methodology behind this 

study. Chapter five depicts the model and describes the experiment. The last chapter 

concludes this process and discusses possible future directions for this line of study. 

 
12 

2 Basic concepts and technologies 

Artificial intelligence is a vast field that is constantly expanding. Its subset is Generative 

Artificial intelligence, which has recently become more popular in the eyes of the public 

(Bengesi et al., 2023). It can be hard to get a grasp of the field as it is moving so fast, but 

using categories can help make it easier to understand. There are various ways to cate-

gorise the process of generating music using AI models (Zhu et al., 2023; Bengesi et al., 

2023). In addition to categorisation, this chapter aims to explain the building blocks of a 

generic neural network and provides an overview of popular Generative Artificial Intelli-

gence architectures, their functionalities, and how they operate. 

 
2.1 Categorisation 

In their 2023 survey, Zhu et al. introduced an approach that divides models into two 

categories: parameter-based and non-parameter-based. A characteristic of this ap-

proach is that the models are differentiated by the type of input they use (Zhu et al., 

2023). The non-parameter-based category is still divided into two subcategories: 

prompt-based and visual-based models.  

 
Generative models can also be divided by model architecture, as shown by Bengesi et al. 

(2023). Architectures that have gained popularity are Generative Adversarial Networks, 

Variational Autoencoders, and Transformers (Bengesi et al., 2023). These architectures 

are described in-depth later in this chapter. 

 
2.1.1 Parameter-based 

Parameter-based models represent the majority of the models listed by Zhu et al. (2023).  

 These models range from Hiller Jr.’s and Isaacson’s (1957) Markov chain models to Dhari-

wal et al.'s (2020) multi-scale Vector Quantized Variational Autoencoder types of deep 

neural networks. Common to these models is that they require specific input parameters 

such as tempo or key (Zhu et al., 2023). 


13 

 
2.1.2 Non-parameter-based  

A good example of prompt-based models is MusicLM, developed by Agostinelli et al. 

(2023). The model takes the text prompt as an input and uses sequence-to-sequence 

modelling to generate multiple-minute-long songs that adhere to the text prompt 

(Agostinelli et al., 2023). Applications of visual-based models like V-MusProd by Zhuo et 

al. (2022) include generating background music for videos by conditioning the model 

with images or video.  

 
2.2 Deep Neural Networks 

As Bengio et al. (2012) neatly put it, AI's goal is to ”understand the world around us”. The 

pursuit of this goal has led researchers to turn to deep learning, which, as previously 

established, involves the use of deep neural networks for machine learning (Briot et al., 

2019, p. 3). Goodfellow et al. (2016, p. 151) describe the fundamental independent ele-

ments necessary for constructing such a machine learning algorithm as a model, a da-

taset, an objective function, and an optimization procedure. Next, this study expands on 

these four basic elements and what are their implications. 

 
2.2.1 Model 

In the context of machine learning, the model, often referred to as a neural network, is 

an artificial construction that mimics the neurons of the human brain (Nwadiugwu, 

2021). Each neuron in a neural network has attributes called a weight and a bias; these 

two, together with the input and activation function, for example, sigmoid activation, 

determine the strength of the signal that is passed to the neuron in the next layer (Good-

fellow et al., 2016, pp. 107, 65-66). This relationship is defined in Equation 1, and in Fig-

ure 1, each arrow represents a weight. In the model, the flow of the information or signal 

happens in two passes, forward pass and backward pass, also referred to as forward 

propagation and back-propagation (Goodfellow et al., 2016, p. 200). 

 
14 

𝑦# = 𝜎(𝑤𝑥 + 𝑏)         (1)  

 
A feedforward network, or Multilayer Perceptron (MLP), is a basic neural network archi-

tecture composed of layers of neurons where, in the forward pass, the signals flow uni-

directionally from the input layer to the output layer (Goodfellow et al., 2016, p.164). 

Figure 1 illustrates a typical feedforward network, designed to approximate a function 

by mapping inputs to outputs, consists of a clearly defined structure with multiple layers: 

an input layer 𝑥, one hidden layer ℎ, and an output layer 𝑦 (Goodfellow et al., 2016, pp. 

164-165). While feedforward networks are foundational, other architectures like Convo-

lutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) cater to specific 

data types and tasks, leveraging unique structural features to process spatial and se-

quential data, respectively (Goodfellow et al., 2016, pp. 326, 367).  

 
Figure 1. A simple feedforward neural network by Goodfellow et al. (2016, p. 170) 

 
CNNs are a specialised type of feedforward network where one or more layers are re-

placed with convolutional layers (Goodfellow et al., 2016, p. 326). The convolutional op-

eration in these layers involves sliding a filter or kernel 𝑤 over the input data 𝑥 to pro-

duce a feature map 𝑠  (Goodfellow et al., 2016, p. 328). This process captures local 


15 

patterns by applying the same filter across various parts of the input, thus enabling CNNs 

to process spatial or multidimensional data like images efficiently (Goodfellow et al., pp. 

330-333). To better understand the convolutional operation, consider a simplified one-

dimensional example: a signal 𝑥  is processed by a shorter filter	𝑤 , which is a set of 

weights. During convolution, the filter 𝑤 is slid along the signal 𝑥, with the element-wise 

multiplication of the filter and the segment of the signal filter covers being computed at 

each position.  

 
𝑠(𝑡) = (𝑥 ∗ 𝑤)(𝑡) = ∑ 𝑥(𝑎)𝑤(𝑡 − 𝑎)!
"#$!      (2)  

 
Mathematically, the convolution operation is depicted by Goodfellow et al. (2016, p. 327) 

as Equation 2, where the asterisk (*) denotes convolution. In this equation, 𝑡 represents 

the index of the element in the feature map that is being calculated. Importantly, the 

filter 𝑤 is shorter than the signal 𝑥, which leads to an interesting property in the equa-

tion: an infinite sum calculated in a finite space (Goodfellow et al., 2016, p. 328). This is 

because outside the bounds of the filter, 𝑤 is zero, resulting in any multiplication involv-

ing 𝑤 outside its bounds also being zero. Another notable property involves 𝑡 − 𝑎, which 

suggests that the filter is reversed, a process often called flipping (Goodfellow et al., 2016, 

p. 328). However, Goodfellow et al. (2016, p. 329) note that in machine learning imple-

mentations, this flipping is often not performed, and the operation should more accu-

rately be called cross-correlation, though it is still commonly referred to as convolution. 

 
2.2.2 Dataset 

The dataset consists of the representation or features of the data and plays a crucial role 

in the model's performance (Bengio et al., 2012). This connection between representa-

tion and performance has inspired researchers to develop algorithms capable of repre-

sentation learning, which has limited the requirement for feature engineering (Bengio 

et al., 2012). Feature engineering refers to a process where the dataset is manually con-

figured into a form that is more acceptable for the model (Bengio et al., 2012). The re-

duced need for manual labour has speeded up the process of utilising artificial 


16 

intelligence (Bengio et al., 2012). In its essence, representation learning refers to a con-

cept in which the neural network is presented with raw data and during training, it au-

tomatically learns the meaningful features of the data (Bengio et al., 2012). There are 

multiple different ways to make a neural network learn, as described in the book Deep 

Learning by Goodfellow et al. (2016, pp. 103-104).  

 
Two of the most prominent ones are called supervised and unsupervised learning. Alt-

hough there is no formal definition, supervised learning typically involves solving regres-

sion and classification problems, while unsupervised learning aims to understand the 

underlying probability distribution of the data (Goodfellow et al., 2016, p. 103). One way 

to understand this is that in supervised learning, the model is given a label 𝑦 in addition 

to the data 𝑥, and it tries to do classification by learning the probability 𝑝(𝑦|𝑥) (Good-

fellow et al., 2016, p. 103). In the unsupervised learning process, the model tries to learn 

the underlying probability distribution 𝑝(𝑥) automatically (Goodfellow et al., 2016, p. 

142). The model can then be used in, for example, anomaly detection, where deviations 

from the expected distribution can signal atypical events (Goodfellow et al., 2016, p. 100). 

 
A third learning method derived from the above-mentioned approaches is self-super-

vised learning, which is usually associated with more complex deep neural networks (Er-

icsson et al., 2022). Ericsson et al. (2022) have divided self-supervised learning into a 

pretext task and a downstream task. In the pretext task, the model uses unsupervised 

learning to capture a meaningful data representation, for example, in a lower dimension 

(Ericsson et al., 2022). The downstream task then utilises this new domain for improved 

learning (Ericson et al., 2022). The related works chapter discusses OpenAI’s music gen-

eration model, Jukebox, which can be thought to represent this learning method where 

Vector Quantized Variational Autoencoder (VQ-VAE) training is considered the pretext 

task and Scalable Transformer training matches the downstream task description. 

 
17 

2.2.3 Objective Function 

 An objective function is a mathematical function that guides a machine learning model 

to adjust its weight and bias parameters in an attempt to minimise or maximise the ob-

jective function (Goodfellow et al., 2016, p. 80). In supervised learning, the goal is often 

to minimise the difference between inputs and outputs, and this is measured with a cost 

function, also known as a loss function (Goodfellow et al., 2016, p. 80; Nielsen, 2015, p. 

16). 

 
In regression tasks, also known as quantitative tasks, the Mean Squared Error (MSE) is 

widely used to quantify the average of the squares of the errors, effectively measuring 

the variance between estimated and actual values (James et al., 2023, p. 28). In contrast, 

for classification tasks or qualitative tasks, Cross-Entropy Loss is frequently employed as 

it quantifies the divergence between the actual labels and the predicted labels and re-

sults in faster convergence compared to Mean Squared Error (James et al., 2023, p. 28; 

Nielsen, 2015, p. 63). Equation 3 depicts MSE and in the equation 𝑦%  is the actual value, 

and 𝑦6%  is the predicted value. 

 
𝑀𝑆𝐸 = &
'
	∑ (𝑦% − 𝑦6%)('

%         (3) 

 
As the cost function is used to direct learning, it can be thought that decreasing cost is a 

sign of learning (James et al., 2023, p. 28). However, this is not always the case, as ma-

chine learning models often suffer from overfitting, a phenomenon where the model 

learns the training data very well but fails to generalise effectively to new, unseen test 

data (Nielsen, 2015, p. 75; Goodfellow et al., 2016, pp. 109). More specifically, it can 

occur when the model parameter count is high and the amount of training data is low 

(Nielsen, 2015, p. 74). Luckily, it is not necessarily a sign that the model is inherently 

unable to learn, as prolonged training can be the cause of overfitting (Nielsen, 2015, 

p.75).  

 
18 

To prevent overfitting, regularisation techniques such as L2 regularization are introduced 

into the cost function (Nielsen, 2015, p. 79). Regularization tries to ensure the model 

does not overly adapt to the noise within the training data (Nielsen, 2015, p. 84). In L2 

regularization, a regularization term is summed to the cost function, and in a machine 

learning setting, it is often squared L2 norm depicted in Equation 4, where 𝜆 is the reg-

ularization parameter, which balances how well the model fits the data and how diverse 

the weight domain gets (Nielsen, 2015, p. 79; Goodfellow et al., 2016, pp. 117, 227).  

 
𝜆‖𝑤‖(( = 	𝜆∑ 𝑤(
)         (4) 

 
Nielsen (2015, p. 86) states that there is no entirely convincing theoretical explanation 

that explains why regularization works. Regularization simplifies the network, and that 

is often offered as a general scientific principle as to why it works, but Nielsen (2015, p. 

85) points out that simpler does not always equal better. Goodfellow et al. (2016, pp. 

117-118) talk about the importance of domain knowledge when designing machine-

learning solutions and how excessive regularization can hinder the model’s ability to 

learn and lead to underfitting. Underfitting is the opposite of overfitting, and it occurs 

when the model is unable to learn the training data (Goodfellow et al., 2016, p. 109). 

The use of regularization boils down to a bias-variance trade-off described by James et 

al. (2023, pp. 242-243), where increased regularization decreases variance but increases 

bias. Based on the above, it can be derived that the goal of regularisation is to restrict 

the model’s capacity to overfit while still having a low enough bias that the model does 

not underfit. Finding this balance is crucial when trying to achieve the best possible gen-

eralization. 

 
More complex neural networks have different types of regularisation methods, one of 

which is Kullback-Liebler (KL) Divergence, which is used as a regularisation term in VAEs 

(Goodfellow et al., 2016, p. 693). It penalises deviations from expected probability dis-

tributions, ensuring desirable properties such as continuity for the posterior distribution 


19 

(Goodfellow et al., 2016, p. 72). These cost functions are foundational to the optimisa-

tion procedure, which is discussed in the subsequent section. 

 
2.2.4 Optimisation Procedure 

An optimisation procedure is generally a very difficult task, and it is also the fourth ele-

ment of a machine learning algorithm described by Goodfellow et al. (2016, pp. 151, 

279). It refers to the process of minimising or maximising the objective function 𝑓(𝑥) by 

optimizing 𝑥 (Goodfellow et al., 2016, p. 80). One of the most prominent optimisation 

procedures is gradient-based optimisation (Goodfellow et al., 2016, p. 80). Previously, 

gradient-based optimization was described as “slow or unreliable”, but since it has been 

accepted that it provides useful results in a reasonable time even though it does not 

always give the optimal solution (Goodfellow et al., 2016, p. 150). In other words, gradi-

ent descent converges to a local minimum or close to it but seldom finds the global min-

imum. Gradient descent is a process that utilises partial and directional derivatives to 

calculate the gradient Δ*𝑓(𝑥) and the objective is to determine the direction that de-

creases 𝑓(𝑥)  the most rapidly (Goodfellow et al., 2016, pp. 82-83). Gradient shows the 

direction of the steepest ascent (Goodfellow et al., 2016, p. 83).  Equation 5 describes 

the optimisation of 𝑥 by nudging it in the direction of the negative gradient (downhill). 

The coefficient 𝜖 is called the learning rate, and it determines the step length for the 

optimisation process (Goodfellow et al., 2016, p. 84). Usually, the learning rate is a small 

constant (Goodfellow et al., 2016, p. 84). 

 
𝑥+ = 𝑥 − 𝜖Δ*𝑓(𝑥)         (5) 

 
Nowadays, the machine learning field is dominated by the stochastic gradient descent 

(SGD) algorithm (Goodfellow et al., 2016, p. 150). SGD is an extension of basic gradient 

descent, and its existence becomes obvious when larger and larger training sets are in-

troduced to improve generalization (Goodfellow et al., 2016, p. 149). The problem with 

regular gradient descent is that when the amount of data 𝑥 grows so does the computa-

tional cost (Goodfellow et al., 2016, p. 149).  This is represented with 𝑂(𝑥) which means 


20 

that the cost is linear (Goodfellow et al., 2016, p. 149). SGD solves this problem by draw-

ing a uniform representation from the data called a minibatch and doing the gradient 

calculation on that limited number of samples (Goodfellow et al., 2016, p. 149). This 

leads to the computational cost becoming independent of the amount of data, and it is 

denoted with 𝑂(1), meaning that the computational cost is constant. (Goodfellow et al., 

2016, p. 149). 

 
2.3 Neural Network Architectures 

Neural network architectures encompass a wide range of models that are based on the 

idea of deep neural networks, yet models based on these architectures differ in their 

objectives and applications. While some neural network architectures, such as Genera-

tive Adversarial Networks (GANs) and Variational Autoencoders (VAEs), are designed to 

generate new data based on learned probability distributions, other architectures, such 

as standard Autoencoders, focus on data reconstruction (Bengesi et al., 2023; Goodfel-

low et al., 2016, p. 499). This chapter will discuss key neural network architectures, in-

cluding Autoencoders, Variational Autoencoders, Generative Adversarial Networks, and 

Transformers, highlighting their purposes and underlying mechanisms. 

 
2.3.1 Autoencoders 

Autoencoders are a neural network architecture focused on unsupervised learning tasks, 

including dimensionality reduction and feature extraction (Goodfellow et al., 2016, p. 

499). The foundation for autoencoders was laid by Rumelhart, Hinton, and Williams in 

1986 with the introduction of backpropagation, enabling neural networks to learn inter-

nal representations (Rumelhart et al., 1986). The concept of autoencoders as a specific 

neural network structure was later advanced by researchers such as LeCun (1987), Bour-

lard and Kamp (1988), and Hinton and Zemel (1994) (Goodfellow et al., 2016, p. 499). 

Typically, an autoencoder comprises three main components: the encoder, a bottleneck 

or latent vector, and the decoder (Goodfellow et al., pp. 499-500). This structure is sim-

ilar to the one illustrated in Figure 3. As the data flows through the network (from left to 


21 

right in Figure 3), the encoder reduces the size of the input data to fit through the bot-

tleneck of the latent space, and the decoder tries to reconstruct the original input from 

this representation (Goodfellow et al., 2016, p. 499).  

 
Training an autoencoder involves using backpropagation and gradient descent to mini-

mise the reconstruction error between the input and the output (Goodfellow et al., 2016, 

p. 499). This process allows the neural network to capture essential features of the data 

while discarding irrelevant information (Rumelhart et al., 1986; Goodfellow et al., 2016, 

p. 499). Goodfellow et al. (2016, p. 499) highlight that learning results can be improved 

by feeding the network incomplete input data and calculating the error with recon-

structed data and complete input data.  

 
Autoencoders are widely used for various purposes, including noise reduction, where 

they learn to reconstruct clean data from noisy inputs (Vincent et al., 2008; Goodfellow 

et al., 2016, p. 499). They are also applied in anomaly detection, where deviations be-

tween the input data and its reconstruction indicate unusual patterns, making them val-

uable for identifying outliers in data (Chalapathy & Chawla, 2019). Moreover, autoen-

coders serve in dimensionality reduction, compressing high-dimensional data into a 

more manageable form and aiding in tasks such as data visualisation and feature extrac-

tion (Hinton & Salakhutdinov, 2006). These applications highlight the versatility of auto-

encoders across various unsupervised learning tasks. 

 
2.3.2 Variational Autoencoders 

The concept of Variational Autoencoding was introduced in the paper Auto-Encoding 

Variational Bayes by Kingma and Welling in 2014. It was developed as a general solution 

for problems with intractable posterior distributions 𝑝(𝑧|𝑥) in which parameters or la-

tent variables are continuous. The proposed solution utilises the stochastic gradient var-

iational Bayes estimator to optimise the approximate posterior distribution 𝑞(𝑧|𝑥). 

 
22 

As described in van den Oord et al.'s 2017 paper Neural Discrete Representation Learn-

ing, a Variational Autoencoder consists of practically three parts: an encoder, a latent 

space, and a decoder (illustrated by Figure 3), where the latent space is often continuous 

(van den Oord et al., 2017). These components map probability distributions that are 

called a posterior distribution 𝑞(𝑧|𝑥), a prior distribution 𝑝(𝑧), and 𝑝(𝑥|𝑧) probability 

distribution (van den Oord et al., 2017). Kingma and Welling (2013) refer to 𝑞(𝑧|𝑥) as a 

probabilistic encoder and 𝑝(𝑥|𝑧) as a probabilistic decoder. Initially, a prior distribution 

𝑝(𝑧) is defined, representing the latent space’s expected shape or form, and it directs 

the posterior distribution in a specific direction; usually, the prior is standard Gaussian 

(van den Oord et al., 2017).  

 
Figure 2. Variational Autoencoder depicted as a figure. Blue boxes represent the encoder, 
white is the latent space, and green boxes represent the decoder. 𝒙 is input, 
and 𝒙C represents output. 

 
During the Variable Autoencoder's training, the model learns to refine the mappings of 

the encoder and decoder by optimising the parameters that define these conditional 

probability distributions 𝑞(𝑧|𝑥), the probability of 𝑧 given 𝑥, and 𝑝(𝑥|𝑧), likelihood of 

observing 𝑥 given 𝑧 (van den Oord et al., 2017). To enable effective training through sto-

chastic gradient descent, the latent variable 𝑧 is transformed into a deterministic func-

tion 𝑔,(𝜖, 𝑥), parameterised by 𝜙, and an independent noise variable 𝜖 , sampled from 

a standard distribution (Kingma & Welling, 2013). This method, called the reparameter-

ization trick by Kingma and Welling (2013), allows for gradient flow during backpropaga-

tion. This can also be visualised with the following code snippet of the 


23 

reparameterization function, where mu and log_var are the	𝜙 parameters and eps is the 

noise variable 𝜖. 

 
def reparametrize(self, mu, log_var):  
std = np.exp(0.5 * log_var)  
eps = np.rand_like(std)  
return mu + eps * std 

 
VAEs have been applied in numerous fields, such as image processing, medical applica-

tions, and language modelling (Wei et al., 2020). In the image processing category, it is 

considered to be state-of-the-art in image classification, image compression, and image 

resolution (Bengesi et al., 2023). More recent advancements have been made in the 3D 

imaging domain, as VAE enables the efficient compression of high-dimensional spaces 

(Molnár & Tamás, 2024). 

 
In 2017, van den Oord et al. introduced a variant of VAEs called a Vector Quantized Var-

iational Autoencoder (VQ-VAE). Unlike traditional Variational Autoencoders, which have 

a continuous latent space, VQ-VAEs have a discrete latent space. This type of autoen-

coder uses a codebook in the quantization process, which makes the latent space dis-

crete (van den Oord et al., 2017). VQ-VAEs are discussed more in-depth in the related 

works chapter. 

 
2.3.3 Generative Adversarial Networks 

As previously mentioned, the Generative Adversarial Network was first developed by 

Goodfellow et al. (2014). The architecture of GAN comprises two neural networks that 

compete with each other. The first network is called a generator, denoted by G, which 

aims to generate convincing samples that can deceive the second network. The second 

network is known as the discriminative network depicted by D, which aims to differenti-

ate between generated samples and real data (Goodfellow et al., 2014). GAN architec-

ture is visualised in Figure 2. 

 
24 

Training the model’s two separate networks happens simultaneously (Goodfellow et al., 

2014). Convergence is considered to be reached when the discriminator network classi-

fication probability approaches 0.5, and the generator network’s probability distribution 

resembles the distribution of the real data (Goodfellow et al., 2014). The advantages 

listed by Goodfellow et al. (2014) are mainly computation improvements compared to 

previous models. A unique aspect of the model that sets it apart from other generative 

architectures discussed in this study is that the generator network is not directly exposed 

to embeddings of real data, which, according to Goodfellow et al. (2014), may lead 

to some statistical advantages. 

 
Figure 3. Generative Adversarial Network architecture depicted by Bengesi et al. (2023). 

 
Generative Adversarial Networks (GANs) have been extensively studied since, as evi-

denced by Jabbar et al.'s 2020 survey on Generative Adversarial Networks: Variants, Ap-

plications, and Training. Applications include image generation in the form of hand-


25 

written font, image blending, texture synthesis, and 3D image synthesis (Jabbar et al., 

2020). Other notable implementations mentioned by Jabbar et al. (2020) are music gen-

eration, video synthesis, and applications in the medical field. 

 
2.3.4 Transformers 

Transformers were initially developed by Vaswani et al. (2017) in a study titled Attention 

Is All You Need. The transformer networks consist of encoder and decoder networks, 

each forming a stack of N identical layers. The fundamental unit of these layers is known 

as the Attention Head, which is responsible for updating each embedding based on its 

relationship with surrounding embeddings (Vaswani et al., 2017). In the paper, Vaswani 

et al. (2017) describe stacking these attention heads to create Multi-Headed Attention. 

An overview of the Transformer model architecture is shown in Figure 5. 

 
Even though the Transformer model’s architecture is similar to the previously mentioned 

Variational Autoencoder architecture, which consists of encoder and decoder networks, 

there are some interesting differences worth examining a little further. In a VAE, the en-

coder network is only required during the training phase, and the inference happens by 

sampling latent space and using the decoder to decode the sampled embeddings 

(Kingma & Welling, 2013). However, in a Transformer network, as visible in Figure 5, the 

encoder is connected to the decoder so that input embeddings serve as context through-

out the autoregressive decoding process (Vaswani et al., 2017). 

 
The most interesting and maybe the most ground-breaking output from Vaswani et al.’s 

(2017) study was the attention mechanism, which they refer to as Scaled Dot-Product 

Attention. In the study, the form of attention is more broadly called self-attention, which 

means that the attention head computes the representation of each element in a se-

quence by considering how it relates to every other element in the same sequence (Vas-

wani et al., 2014). In contrast to Recurrent Neural Networks (RNN), which rely on se-

quential processing, this attention method allows each position to attend independently 


26 

to all positions simultaneously, making the model more computationally efficient (Vas-

wani et al., 2017). 

 
Figure 4. Scaled Dot-Product Attention architecture (Vaswani et al. 2017).  

 
Figure 4 visualises the architecture of Scaled Dot-Product Attention. To better under-

stand the process depicted in Figure 4, it is helpful to go through it step-by-step. Figure 

5 shows how input is transformed into input embeddings and enhanced with positional 

encoding. For each of the embeddings, the process depicted in Figure 4 is performed. 

The variable 𝑄 represents a query matrix that is a product of all the embeddings 𝐸-  mul-

tiplied by 𝑊., and similarly, 𝐾 represents a key matrix and equals 𝐸-  multiplied by 𝑊/  

and then matrix multiplication is calculated between 𝑄 and 𝐾 (Vaswani et al., 2017).  

 
According to Vaswani et al. (2017), the next step of scaling the result with J𝑑0 is what 

sets this method apart from regular Dot-product attention. In the study, Vaswani et al. 

(2017) talk about the theoretical complexity of additive attention and dot-product atten-

tion being similar and justify the use of dot-product due to it being faster because of 


27 

optimised matrix multiplication calculations. However, with sufficiently large 𝑑0 the ad-

ditive attention outperformed the dot-product attention, introducing the need for scal-

ing to prevent the vanishing gradient problem in the softmax layer (Vaswani et al., 2017).  

 
The next step is a masking operation that is only done on the decoder side, as shown in 

Figure 5. During the autoregressive process, this prevents the decoder from attending to 

future data that otherwise would influence the present prediction (Vaswani et al., 2017). 

This is achieved by replacing the result of the matrix multiplication between 𝑄 and 𝐾 

with negative infinity for all the connections onwards from 𝐸1, where 𝑡 is the index of 

the embedding currently being processed (Vaswani et al., 2017). This adjustment en-

sures that during the softmax calculation, which produces the weights in the form of a 

probability distribution, only the desired weights are assigned a coefficient of zero, ef-

fectively disregarding their influence (Vaswani et al., 2017). The full context is maintained 

for the encoder side, where the masking operation is skipped (Vaswani et al., 2017). 

 
The final step in the process illustrated in Figure 4 involves the matrix multiplication of 

the weight matrix with the value matrix 𝑉, which is the product of 𝐸-  and 𝑊2. This op-

eration yields ∆𝐸- , representing the direction in which the original embedding 𝐸-  

should be adjusted. In the Multi-Head Attention model, there are ℎ attention heads, 

each generating a ∆𝐸-. These matrices are concatenated to determine the unified direc-

tion for adjusting the original embeddings, enhancing the model's capacity to integrate 

various contextual insights (Vaswani et al., 2017). 

 
The most famous transformer adaptation is probably Generative Pre-trained Trans-

former (GPT), which was released in 2018 by OpenAI (Bengesi et al., 2023). It is a large 

language model aiming to generate human-like text, and multiple versions have been 

released since (Bengesi et al., 2023). Other notable adaptations of transformer architec-

ture are Bidirectional Encoder Representations from Transformer (BERT) and Vision 

Transformer (ViT), which have produced state-of-the-art results in Natural Language Pro-

cessing (NLP) and in Computer Vision (CV) tasks, respectively (Chitty-Venkata et al., 2023). 


28 

 
Figure 5. Transformer architecture (Vaswani et al. 2017). The figure consists of an en-
coder on the left and a decoder on the right. 

 
29 

3 Related Work 

With the relatively recent increase in computational power and some breakthroughs in 

the adjacent field of image generation, AI-generated music has started to take bigger 

and bigger leaps forward. Some of the most notable papers in the context of this thesis 

will be discussed in detail below. Common to all these models, in the spirit of this thesis, 

is the use of raw audio data as input and output for the neural network, along with the 

incorporation of convolutions. To gain a deeper understanding of AI-generated music, it 

is essential to explore the key research problems that have shaped the field.  

 
Goel et al. (2022) identify three major challenges encountered by researchers when de-

signing architectures to model waveforms. The first challenge is maintaining global co-

herence in the modelled waveform, which requires the neural network to effectively 

capture long-range dependencies (Goel et al., 2022). This is supported by Dieleman et 

al. (2018) as they state that attaining globally coherent music has proven difficult. The 

second challenge discussed by Goel et al. (2022) is computational efficiency. High-fidelity 

audio involves orders of magnitude more input parameters than a simple image, as ex-

plained in the introduction chapter. For instance, the models in Dhariwal et al.'s (2020) 

Jukebox were trained for two to four weeks using up to 512 GPUs, a feat achievable only 

by the most well-funded research initiatives even by today's standards. The third chal-

lenge is sample efficiency, which is closely related to computational efficiency. According 

to Goel et al. (2022), sample efficiency refers to the model's ability to converge with 

fewer training samples, thereby enhancing overall computational efficiency. By improv-

ing sample efficiency, the model can achieve effective performance with less data, re-

ducing the computational resources and time required for training. 

 
3.1 WaveNet 

In 2016, Google DeepMind research laboratory released WaveNet, a deep neural net-

work model capable of generating music. As described in the 2016 paper by van den 

Oord et al., the model is autoregressive and probabilistic, meaning that each newly 


30 

generated audio sample is conditioned by its priors. More concretely, the model calcu-

lates a conditional probability distribution from which each new output is sampled. As 

previously mentioned, the probability distribution is conditioned by all previous samples. 

All these probability distributions are joined to form the joint probability of waveform x, 

as observed in Equation 1. 

 
𝑝(𝑥) = ∏ 𝑝(𝑥1	|	𝑥&, … , 𝑥1$&)-
1#&        (1)  

 
To achieve this functionality, van den Oord et al. (2016) depict a stack of causal convolu-

tion layers which model the conditional probability distribution.  

 
3.1.1 Causal and Dilated Convolutions 

One-dimensional causal convolutions, as described by van den Oord et al. (2016), are 

different from standard 1D convolutions. In causal convolutions, padding of size K-1, 

where K is the kernel size, is applied asymmetrically to the past or left side of the se-

quential input data. This practice prevents the model from accessing future information 

when predicting present data, as shown in Figure 6. 

 
Figure 6. Different types of convolutions. Blue dots depict the input layer, grey marks the 
hidden layers, and yellow is the output layer. 

 
WaveNet not only utilises causal convolutions but takes it one step further by imple-

menting dilated causal convolutions, where the dilation doubles between layers (van 


31 

den Oord et al., 2016). This technique expands the model's receptive field, enabling it to 

capture longer temporal dependencies in audio sequences, which is vital for generating 

coherent and realistic sound over extended periods. Additionally, it is more computa-

tionally efficient and maintains the same input and output size, which is not the case 

with other convolution techniques like strided convolution that can be used for capturing 

longer temporal structures. 

 
3.1.2 Non-linear Quantisation 

Another notable technique in terms of music generation that van den Oord et al. (2016) 

discuss is performing a non-linear quantisation for the input signal. Their proposed 

method involves transforming the input signal using the μ-law companding technique 

visible in Equation 2, where −1 < 𝑥1 < 1 and 𝜇 = 255, and quantising the data into 256 

possible values. As a result, the data has a higher resolution on lower amplitudes due to 

its logarithmic nature (ITU-T, 1988). 

 
𝑓(𝑥1) = 	𝑠𝑖𝑔𝑛(𝑥1)
34(&	7	8|*!|)
34(&	7	8)

	        (2)  

 
To tie it together, van den Oord et al. (2016) utilise a softmax distribution to predict the 

likelihood of each of the 256 quantised values. Based on this, the most probable value 

can be selected. Alternatively, to increase variety, the distribution can be sampled.  The 

benefit of quantisation is clear: It simplifies the representation of audio samples, reduc-

ing the data complexity from 16-bit to 8-bit per sample, thereby decreasing the softmax 

distribution's output space and computational cost.  

 
In contrast to previous studies, van den Oord et al. (2016) decided to use a gated activa-

tion function instead of a rectified linear activation function as it produced better results 

when modelling audio signals. Their gated activation unit consists of two adjacent con-

volutional layers, filter and gate, with respective tanh and sigmoid activation functions. 


32 

In addition, the units contain residual and skip connections, making the model conver-

gence faster and allowing the use of deeper networks.  

 
3.2 Jukebox 

The next big leap in AI-generated music came in 2020 when OpenAI researchers Dhari-

wal et al. released a Jukebox model. The research listed its achievements as generating 

multiple-minute songs that stay coherent and contain singing. The actual model consists 

of three layered VQ-VAEs, Vector Quantized Variational Autoencoders, which learn to 

encode the music into embeddings. For inference, Dhariwal et al. (2020) used auto-

regressive Scalable Transformers. When generating music with the model, it is possible 

to prime it with text or audio data. 

 
3.2.1 Multi-scale VQ-VAE 

The Jukebox model comprises three VQ-VAEs operating on different temporal scales. 

Each VQ-VAE is trained separately to prevent the model from relying solely on the one 

that learns the highest temporal resolution (Dhariwal et al., 2020). Each VQ-VAE utilises 

WaveNet-like 1D convolutions mirrored for the encoder and decoder, intending to in-

crease the model's receptive field (Dhariwal et al., 2020). 

 
In Jukebox, Dhariwal et al. (2020) describe the quantisation process of a one-dimensional 

VQ-VAE where an input signal 𝑥 = {𝑥&, … , 𝑥-} is learnt to encode with  𝑧 = {𝑧&, … , 𝑧;} 

indices, where 𝑇 is the input length, 𝑆 is the count of indices, and 𝑇/𝑆 is the hop length, 

indicating the level of dimension reduction. During the encoding process, 𝑥  is trans-

formed into latent vectors ℎ = {ℎ&, … , ℎ;} , and each latent vector is mapped to the 

nearest codebook vector 𝑒<" ∈ 𝐶 = {𝑒&, … , 𝑒/} ,where 𝐾 is codebook size. This results 

in the discretisation of the latent space.  

 
33 

For the training, Dhariwal et al. (2020) use a loss sum that consists of three separate loss 

functions. The first one is trivial reconstruction loss calculated between 𝑥 and decoded 

codebook vector sequence as described in Equation 3.  

 
ℒ=>?@'A1=B?1%@' =	
&
-
∑ ‖𝑥1 − 𝑥1‖((1       (3)  

 
The other two loss functions are called codebook loss and commit loss. Both utilise the 

stop gradient (sg) function, which essentially means that in backpropagation, the sg vec-

tor is locked in place as the gradient is set to zero. The codebook loss penalises the model 

when the codebook vector 𝑒<"  is far from the encoded vector ℎA and in commit loss, it is 

the other way around, and the model is penalised if the encoded vector ℎA is far from 

the codebook vector 𝑒<". The respective loss functions are detailed in Equations 4 and 5 

(Dhariwal et al., 2020) 

 
ℒ?@C>D@@0 =	
&
;
∑ ^𝑠𝑔[ℎA] − 𝑒<"^(

(
A       (4)  

 
ℒ?@EE%1 =	
&
;
∑ ^ℎA − 𝑠𝑔[𝑒<"]^(

(
A       (5)  

 
In combination, these loss functions produce the total loss function for the model de-

tailed in Equation 6. Dhariwal et al. (2020) state that ℒ?@EE%1 is added so that the model 

would try to constrain the values of ℎA closer to possible 𝑒<"  values and is said to have a 

stabilizing effect. The weight of commit loss can be controlled with the 𝛽 value. 

 
ℒ = 	ℒ=>?@'A1=?B1%@' + ℒ?@C>D@@0 + 𝛽ℒ?@EE%1     (6)  

 
In the paper, Dhariwal et al. (2020) discuss a problem in which the model only learns to 

reconstruct low frequencies when using sample-level reconstruction loss. The proposed 

solution uses the Short-Time Fourier Transform (STFT), which helps the model to learn 


34 

mid-to-high frequencies. The resulting loss is called spectral loss and is defined in Equa-

tion 7. 

 
ℒAF>?1="G =	‖𝑆𝑇𝐹𝑇(𝑥1) − 𝑆𝑇𝐹𝑇(𝑥1)‖(     (7)  

 
VQ-VAEs have some known problems, one of which is codebook collapse. This means 

that most of ℎA get mapped to only a few of 𝑒<". To mitigate this problem, Dhariwal et al. 

(2020) introduce random restarts, which means that if the average usage of an embed-

ding is too low the 𝑒<"  vector gets randomly replaced by one of the ℎA vectors. 

 
3.2.2 Scalable Transformers and Upsampling 

In addition to the VQ-VAE neural networks, Jukebox consists of autoregressive trans-

formers for the actual music inference. In the paper, these transformers are divided into 

a top-level prior and upsamplers called middle and bottom. The prior 𝑝(𝑧) is a joint con-

ditional probability distribution, with each component similar to Equation 1. The com-

plete distribution is defined in Equation 8 (Dhariwal et al., 2020). 

 
𝑝(𝑧) = 	𝑝c𝑧1@Fd𝑝c𝑧E%CCG>e𝑧1@Fd𝑝(𝑧D@11@E|𝑧E%CCG> , 𝑧1@F)   (8)  

 
The approach chosen by Dhariwal et al. (2020) simplifies the prediction task, as it occurs 

in a discrete space and can be categorised as a classification problem instead of a regres-

sion problem. Basically, the transformers predict codebook indices autoregressively. 

 
In the model, inference with transformers happens top-down, and each prediction layer 

consists of the same number of discrete codes that map to shorter and shorter segments 

in the raw audio domain. Upsamplers are conditioned only with the previous layer’s dis-

crete codes that match the raw audio length of the current layer. These codes go through 

a conditioning layer that uses dilated convolutions, similar to WaveNet, leading to an 

increase in the number of discrete codes after each layer, recreating the lost information 

and resulting in increased audio resolution (Dhariwal et al., 2020). 


35 

 
To decrease entropy, Dhariwal et al. (2020) encode artist and genre to embedding vec-

tors. This can also be used to direct the model during the generation phase. They also 

provide the model with the total length of the audio signal and the start and end times 

of the current segment. 

 
3.3 Summary 

Both studies claimed state-of-the-art performance and results at the time of release. 

Even if this is true, it doesn't necessarily provide a complete understanding of the mod-

els' actual capabilities. Therefore, it is valuable to analyze both models in light of the 

challenges highlighted by Goel et al. (2022). One general concern that can be made from 

this line of study is the lack/difficulty of directing the generation in the wanted direction. 

Compared to other models that use prompt-based sequence-to-sequence generation 

like MusicLM (Agostinelli et al., 2023), it can be more difficult to direct the generation in 

the desired direction, especially into novel directions when existing audio does not exist, 

and hence it cannot be used as a prior. Also, it is important to note that although Wave-

Net possesses music generation capabilities and those are talked about in the study, the 

main focus of the study was on text-to-speech generation. 

 
The first challenge brought up by Goel et al. (2022) is the requirement for global coherent 

generation. Jukebox’s approach clearly enables it to obtain some longer-range structure 

and maintain it throughout the generated sample, but at the cost of the audio's fidelity. 

This observation is aligned with previous research, as Dieleman et al. (2018) noted that 

using hierarchal autoregressive inference led to improved long-range structure and de-

creased signal quality, indicating a trade-off. Similarly, WaveNet's application to music 

generation demonstrated the importance of a large receptive field to produce samples 

that sounded musical (van den Oord et al., 2016). Despite this, the models struggled with 

long-range consistency, resulting in second-to-second variations in genre, instrumenta-

tion, and volume (van den Oord et al., 2016). However, the generated samples were of-

ten harmonic and aesthetically pleasing, particularly when conditional models were 


36 

used to control specific aspects of the output based on tags like genre or instruments 

(van den Oord et al., 2016). 

 
The assessment of Jukebox compliance with the second point raised by Goel et al. (2022) 

reveals that computational efficiency is still a problem when training neural networks to 

model audio waveforms. In their paper, Dhariwal et al. (2020) mention four different 

training events: the two upsampler networks were trained with 128 GPUs for 2 weeks, 

top-level prior training took 4 weeks with 512 GPUs, and lastly, the lyrics conditioning 

training they performed with 128 GPUs for a total of 2 weeks. These numbers highlight 

significant monetary and time constraints for conducting research on this subject. The 

cost estimate for the GPUs alone amounts to millions of dollars, approximately 6.4 mil-

lion USD, and does not include other necessary hardware expenses for building a system 

capable of training these networks. 

 
The computational demands of using the original WaveNet model are significant due to 

its autoregressive nature, which requires generating audio one sample at a time (van 

den Oord et al., 2016). While the original WaveNet paper by van den Oord et al. (2016) 

does not discuss the hardware or training durations in detail, additional insights can be 

obtained from the 2017 paper "Parallel WaveNet: Fast High-Fidelity Speech Synthesis" 

by van den Oord et al. (2017). In this paper, the authors state that the re-engineered 

WaveNet's inference was 1,000 times faster than the original, with the ability to generate 

one second of audio in just 50 milliseconds (van den Oord et al., 2017). This implies that 

the original model took approximately 50 seconds to generate one second of audio. 

Based on the fact that generating one second of audio took 50 seconds and that further 

research was conducted to improve this, it can be deduced that even though the training 

of the original WaveNet was efficient, the total time it took to use it made it computa-

tionally inefficient, not overcoming the second challenge defined by Goel et al. (2022). 

 
The last challenge mentioned by Goel et al. (2022) is sample efficiency, closely inter-

twined with computational efficiency as both affect the model’s performance. Sample 


37 

efficiency aims to achieve better performance through inductive biases (Goel et al., 

2022). This means that the model’s learning can be enhanced by making the right design 

and architectural choices	(Hüllermeier et al., 2013, p. 1018). In the case of WaveNet, an 

example is its autoregressive inference, which assumes that each new sample depends 

on the previous ones (van den Oord et al., 2016). For humans, it is intuitive that in music, 

each note depends on the previous ones, but for machines, this connection can be chal-

lenging to learn without the aid of inductive bias. Similarly, Jukebox employs a hierar-

chical VQ-VAE architecture, which introduces inductive biases by modelling music at 

multiple levels of abstraction. This approach allows Jukebox to efficiently capture both 

the long-term structure and the fine details of music, enhancing its sample efficiency 

(Dhariwal et al., 2020). 


38 

4 Methodology 

In this thesis, the chosen methodological approach is a controlled experiment. As the 

term implies, the objective is to establish a controlled environment where the experi-

ment can be conducted (Järvinen, 2018; Walliman, 2010, p. 11). The fundamental pur-

pose of experimental research is to establish causality by carefully controlling and ma-

nipulating variables; researchers can isolate specific factors and observe their direct im-

pact on outcomes (Järvinen, 2018; Walliman, 2010, p. 103). This approach is used to test 

hypotheses, validate theories, and contribute to the body of knowledge in a systematic 

and replicable manner that allows making informed decisions based on empirical evi-

dence. 

 
 In Järvinen's 2018 study On Research Methods, two critical factors for the new 

knowledge are identified: the necessity for rich and applicable knowledge and the need 

for reliable knowledge. These requirements often conflict. Specifically, when designing 

research, imposing numerous constraints to enhance reliability can strip the experi-

mental context of factors that connect it to real-world conditions (Järvinen, 2018). Con-

sequently, achieving a balance between these two factors is crucial to obtaining valuable 

insights from an experiment (Järvinen, 2018). 

 
The basic terminology of controlled experiments includes dependent variables, inde-

pendent variables, and intervening variables (Järvinen, 2018). The dependent variable(s), 

as defined by Järvinen (2018), represent the quantitative outcomes of the study, meas-

uring aspects that indicate improvement or deterioration, a definition corroborated by 

Walliman (2010, p. 11). The independent variable(s) are the parameters controlled by 

the researcher, with the premise that manipulating these independent variables should 

result in a measurable change in the dependent variable(s) attributable to the independ-

ent variable(s) (Järvinen, 2018; Walliman, 2010, p. 11). The intervening variable(s) are 

those that are not under the control of the researcher but still affect the dependent var-

iable and hence cannot be excluded from the research (Järvinen, 2018). The variables 


39 

that do not fall into these categories are commonly referred to as unknown variables, 

which are assumed not to have an impact on the research (Järvinen, 2018). 

 
Experimental research can be enhanced with a control experiment in addition to the 

main experiment to increase the certainty that these unknown variables do not affect 

the result (Järvinen, 2018). This is done to rule out that arriving at the result truly was 

caused by the manipulation of independent variables and not by other unknown varia-

bles or other factors (Järvinen, 2018). Two additional criteria, listed by Järvinen (2018), 

that can be evaluated to gain more proof of the causality are association or relationship 

and temporal precedence. The association or relationship evaluation relies on proving 

the existence of covariance between independent and dependent variables, i.e., change 

in the independent variable shows reliable and observable change in the dependent var-

iable (Järvinen, 2018). Proving the temporal precedence of events means that in order 

for two things to be causally linked, the change in the dependent variable must always 

happen after the change in the independent variable (Järvinen, 2018). 

 
The practical implementation of this research involved conducting a controlled experi-

ment to test whether a convolutional architecture could improve an autoencoder’s abil-

ity to learn audio data representation. The process began with building a baseline 

model—a fully connected autoencoder with sufficient performance for comparison. Fol-

lowing this, a convolutional autoencoder was created to evaluate the effects of convolu-

tional layers on the model’s performance. Both models were trained from zero until con-

vergence, and the results were analyzed to assess differences in performance, model 

size, and output quality. 


40 

5 Experiment Design and Model Evolution 

This research is focused on testing a hypothesis through a controlled experiment. It in-

volves comparing two models to determine if convolution can improve the model's abil-

ity to learn audio data representation. The first step was to create a prototype model 

with a decent baseline performance, against which the convolution model could be com-

pared. The next step was to create a convolution model and, thirdly, analyse the differ-

ences in performance, size, and quality of the result.  

 
The initial idea was to study the effects of multidimensional convolution on AI-generated 

music. The prototyping phase turned out to be the most time-consuming part of the 

thesis work. Due to time constraints, the research scope had to be adjusted as the pro-

totyping phase continued. The experiment also faced constraints imposed by the system 

used for testing, which led to the decision to omit the generative functionality from the 

neural network due to these limitations.  As previously described, the reduced scope 

focuses on the performance differences between a fully connected Autoencoder net-

work and a 1D convolution Autoencoder network. 

 
The initial plan was to acquire a raw audio dataset and do minimal preprocessing before 

feeding that raw audio data into a generative neural network. After the initial model was 

ready, the goal was to play with the input data’s dimensionality and convolutions to see 

if that would improve the consistency of the structure in the long term. In the planning 

phase, a Variational Autoencoder was selected as the generative neural network type. 

The initial model was built, but during the training, the validation metrics showed that 

the model could not learn a meaningful representation of the data. This sparked the 

prototyping process	that followed a continuous feedback loop with three key phases. 

First, the model underwent training and testing. Next, the results were analysed. Finally, 

based on these insights, the model and/or preprocessing techniques were refined and 

updated, leading back to the next iteration of training and testing. Numerous prepro-

cessing methods were tried during the prototyping process, such as μ-law companding 

and quantisation, as explained by van den Oord et al. in 2016. Throughout the process, 


41 

the original model structure was adjusted to account for changes in the dimensionality 

of the input data. Initially, the 1D sequential audio data was converted into a 2D image-

like format. Subsequently, it was further transformed into a 3D video-like format during 

the following prototyping iterations. As previously mentioned, due to the inability to 

construct a model capable of learning meaningful data representations within the con-

straints, the decision was made to use an Autoencoder neural network instead of a Var-

iational Autoencoder network. 

 
5.1 Dataset and System  

This research was performed on Faraldo’s (2017) Beatport EDM Key Dataset, which con-

sists of 1486 songs in the Electronic Dance Music (EDM) genre. The data was divided into 

training, validation, and test sets, so the training set consisted of 1300 songs, the valida-

tion set had 130 songs, and the test set included 20 songs. These songs were split into 

segments. Different segment lengths were tested, but a 3-second segment length was 

selected. The audio data has a frequency of 44100 Hz, meaning there are 44.1 thousand 

data points per second. To make this manageable, data quality was down-sampled by a 

factor of ten, resulting in 4410 data points per second. Combined with the selected 3-

second segment length, this results in a 13230 input length in the neural network. 

 
As previously referenced, the importance of computational resources cannot be over-

emphasised. The models' prototyping and actual experiments were performed on a PC. 

The PC had an Intel i7-4770K processor and 24 GB of DDR3 RAM. Neural network training 

was performed on a dedicated GPU, Nvidia GeForce GTX 1080, which has 8 GB of dedi-

cated VRAM and an additional 12 GB of shared GPU memory, for a total of 20 GB of 

memory. The bottleneck in the process turned out to be the GPU memory. The main 

limitation of GPU memory is its impact on the size of the neural network it can handle. 

In this context, size is closely related to complexity, as increasing the number of layers or 

the size of each layer in the neural network increases its memory usage. When proto-

typing, I found that a model with three fully connected layers containing 9800, 6500, and 

3300 neurons resulted in a size of over 10 GB when loaded into the GPU memory. 


42 

 
5.2 Model Evolution 

This research resulted in the development of two neural networks, each designed to 

capture specific representations of audio data. Additionally, it outlines the preprocessing 

steps necessary for the success of these neural networks. As discussed earlier, the model 

development process followed a prototyping approach, which created a feedback loop. 

This iterative process led to several unsuccessful combinations of preprocessing tech-

niques and generative neural network models. 

 
As the project timeline became more constrained, a strategic decision was made to nar-

row the scope by excluding the generative neural network. Following this decision, many 

previously developed preprocessing methods were reevaluated using a standard Auto-

encoder instead of a Variational Autoencoder. However, the performance outcomes 

were unsatisfactory, suggesting that the primary limiting factor in this study may have 

been the system's computational capacity in relation to network size. 

 
5.2.1 Preprocessing 

The preprocessing in this study can be divided into two categories: non-transformative 

and transformative methods. Non-transformative methods modify the data’s dimen-

sions without altering its internal structure. Examples of such preprocessing steps in-

clude data segmentation and sampling, which were applied consistently throughout the 

prototyping process. In the sampling step, the original 44.1 kHz signal was downsampled 

to 4.41 kHz, reducing the number of data points by a factor of 10. Similarly, in the data 

segmentation step, the downsampled 2-minute-long songs were divided into more man-

ageable 3-second chunks. These methods were introduced primarily to reduce the com-

putational load on the system, as processing large datasets in their original form would 

have been resource-intensive. Other preprocessing methods that fall into the first cate-

gory and align with the study’s initial plan were modifications to the input data dimen-

sions. 


43 

In contrast, transformative methods involve altering the data’s internal relationships or 

converting it from one coordinate system to another. Initially, the plan did not include 

any data-transforming preprocessing, as the goal was to work with the raw data as much 

as possible. However, when it became evident that the neural networks could not effec-

tively learn from the raw audio data, transformative preprocessing steps were added to 

the workflow. These transformations aimed to modify the data to better align with the 

neural network’s learning capabilities. Various transformative preprocessing techniques 

were tested individually and in combination as part of the iterative prototyping process. 

 
Early attempts at transformative preprocessing involved data normalisation, which is 

beneficial in some instances, as pointed out by Singh and Singh (2019). Min-max normal-

isation was selected, with a range of [-1, 1], to align with the sigmoid activation function's 

output range used by the model at that time. Despite this alignment, no noticeable im-

provement was observed in the behaviour of the neural networks. 

 
def mu_law_companded(x, mu=255): 
    # Ensure the input is in the range [-1, 1] 
    x = np.clip(x, -1, 1) 
 
    # Apply µ-law companding 
    x_mu = np.sign(x) * (np.log1p(mu * np.abs(x)) / 
np.log1p(mu)) 
 
    return x_mu 

 
Following preprocessing attempts, utilised μ-law companding, as described by van den 

Oord et al. (2016). The code snippet above demonstrates how the transformation was 

applied, where the input data was first clipped to the range of [-1, 1] before using the μ-

law companding transformation. Clipping was necessary for this process, as indicated in 

Equation 2. When examining Waveform 1 in Figure 7, it is clear that most amplitudes 

remained within the range of [-1, 1], making clipping an effective solution. Listening tests 

comparing Waveform 1 and Waveform 3 revealed no significant auditory differences, 

even though clipping resulted in minimal data loss. This indicates that clipping did not 

significantly affect the quality of the processed audio. 


44 

 
def mu_law_decoding(y, mu=255): 
    # Apply inverse µ-law decoding 
    x = np.sign(y) * (1.0 / mu) * (np.power(1 + mu, np.abs(y)) 
- 1) 
 
    return x 

 
The above code snippet depicts how μ-law companding transformation was reversed, 

and from Figure 7, the effects of μ-law companding on a waveform can be observed. The 

first plot shows Waveform 1, the validation waveform, without any modifications. The 

second plot displays Waveform 1 after applying μ-law companding using the function 

visible in the code snippet above. The third plot exhibits Waveform 2 after reversing the 

μ-law companding using the function in the code snippet below. However, feeding the 

neural network with μ-law companded data did not lead to any performance improve-

ments. 


45 

 
Figure 7. Effects of μ-law companding on a waveform. 

 
Van den Oord et al. (2016) utilised μ-law companding along with quantisation, prompt-

ing the consideration of incorporating this technique in the preprocessing phase. This 

method involves transforming the problem from a regression to a classification problem. 


46 

However, as this was not the chosen preprocessing method for this study, even though 

providing a detailed explanation would be interesting, it would also divert focus from the 

current study. Therefore, it is only provided here for context. 

 
The preprocessing method that ultimately proved effective involved a combination of 

steps. First, non-transformative techniques such as downsampling and segmentation 

were applied. Each 3-second segment was then transformed using a Real Fast Fourier 

Transform (RFFT), converting the data into the frequency and magnitude domain. After 

the RFFT transformation, the length of each 3-second segment was halved, with the re-

sulting array containing values in the form of real ± imaginary coefficients. In the final 

step, the array size was doubled back to its original length by splitting each value into 

separate real and imaginary coefficient components. The final array alternated between 

real and imaginary coefficients, structured as [real_value, img_coeff, real_value, img_co-

eff, …], making the data suitable for input into the neural network. It's interesting to note 

that the length of the segment, when combined with the RFFT, impacts the neural net-

works' capacity to learn. When the process of segmenting and RFFT was reversed so that 

the complete 2-minute song was taken through the RFFT and only then segmented into 

“3-second” segments, it led to the network's inability to capture a meaningful represen-

tation of the data. 

 
5.2.2 Model development 

This study produced two distinct Autoencoder neural network architectures: the base 

model and the convolutional model. The base model was created as a reference point 

for comparing the performance of the convolutional model. 

 
The base model consists of 5 fully connected layers, similar to the model shown in Figure 

4. It comprises an encoder side (depicted in blue in Figure 4) and a decoder side (de-

picted in green in Figure 4). In the middle, there is a bottleneck layer that determines 

the minimum dimensionality through which the data is passed. This layer can signifi-

cantly impact the model's ability to learn a representation of the data, as having too few 


47 

nodes can make it impossible to capture all the important attributes of the data. In a 

network layer, there are input and output nodes, which are similar to the two bottom-

most layers shown in Figure 1. The connections are as depicted in Figure 1, where each 

node is connected to all consecutive nodes; hence, the name is fully connected. The 

preprocessing creates data blocks the size of 13230 attributes; this determines the first 

layer input size. The two layers in the encoder reduce the attributes first to 9800 and 

then 6500. The bottleneck layer, also known as latent space/vector, further reduces the 

dimensionality to 3300 attributes. The decoder mirrors the encoder and upscales the 

data back to 13230 in 2 layers. This model uses a Rectified Linear Unit activation function 

(ReLU) where each layer’s output is pushed through the activation function except for 

the decoder’s final layer. 

 
The convolutional model, as described by Goodfellow et al. (2016, p. 326), is a type of 

neural network where one or more layers are replaced with convolutional layers. In this 

research, the model consists of 8 layers, with six being convolutional and two being lin-

ear (fully connected). The model design is shown in Figure 8 and Figure 9. Figure 8 illus-

trates the encoder side of the Autoencoder design, with each block representing the 

interface between layers. The numbers above the blocks represent the channels in each 

interface. For example, the first layer takes data in one channel, but after the first con-

volution, the data is channelled into 16 separate channels. This increase in the number 

of channels is represented by the increased thickness of the blocks in the figures. Addi-

tionally, each convolution reduces the input length for the next layer, which is visualised 

by the decrease in the "area" of each block.  


48 

 
Figure 8. The encoder part of the Convolutional Autoencoder is in blue, and the first 
latent vector is in white. 

 
The code snippet below illustrates how the input length decreases as it passes through 

the encoder side. In the code, 40 represents the batch size, while 16, 32, and 64 repre-

sent the channel counts, and the last number represents the input length. An interesting 

observation from the code is that doubling the channel size halves the input length. This 

is because of the kernel attribute selection for the convolutions. Each convolution layer 

uses the same kernel attributes, which are size: 4, stride: 2, and padding: 1. The "flatten" 

operation is essential for transforming the 2D data back to a 1D form, allowing it to be 

fed to the linear bottleneck layer. 

 
Encoder input: torch.Size([40, 1, 13230]) 
Conv1: torch.Size([40, 16, 6615]) 
Conv2: torch.Size([40, 32, 3307]) 
Conv3: torch.Size([40, 64, 1653]) 
Flatten: torch.Size([40, 105792]) 
Bottleneck: torch.Size([40, 3300]) 

 
In Figure 9, we can see that the decoder side of the autoencoder mirrors the encoder 

side. One key difference is that the decoder's convolutional layers use transpose 


49 

convolution operations to reverse the effects of the convolution operations. The input 

length is doubled when the channels are halved in the decoder. However, this alone is 

insufficient to reach the original data dimensionality, as shown in the code snippet below. 

To address this mismatch, the decoder side uses a fourth kernel attribute called output 

padding, which increases the dimensions of the output by one. Similar to the base model, 

all the layers in the autoencoder go through the ReLU activation function except for the 

decoder's last layer. 

 
Decoder input: torch.Size([40, 3300]) 
Decoder fc: torch.Size([40, 105792]) 
Reshape: torch.Size([40, 64, 1653]) 
Conv3: torch.Size([40, 32, 3307]) 
Conv2: torch.Size([40, 16, 6615]) 
Conv1: torch.Size([40, 1, 13230]) 

 
Figure 9. The decoder part of the Convolutional Autoencoder is in green, and the second 
latent vector is in white. 

 
5.2.3 Testing setup 

As previously established, this test evaluates the difference between linear and convo-

lutional neural network architectures. Both networks’ training starts from zero and is set 


50 

to last until convergence or until a max epoch limit of 500 is reached. In this study, con-

vergence is considered to be achieved when there are 50 epochs with no improvement 

in validation loss. The testing setup consists of multiple variables, which in the controller 

experiment world can fall under the category of either unknown variables or intervening 

variables. To limit the number of variables that fall into either of the categories, all other 

variables are kept constant between the test runs. These variables include system set-

tings, dataset composition, and training setup. Chapter 5.1 discusses system settings and 

the dataset in detail, so this section elaborates on the testing setup.  

 
Two parts that critically affect the model's ability to learn are the objective function and 

optimization procedure, as explained in Chapter 2. In this study, the Mean Squared Error 

(MSE) function, presented in Equation 3, is employed as the objective function in align-

ment with established literature, given the regression nature of the problem (James et 

al., 2023, p. 28). Building upon the optimization concepts discussed in Chapter 2, this 

study utilizes the Adam optimizer for training the neural network model. Adam, short 

for Adaptive Moment Estimation, is an extension of stochastic gradient descent that 

computes adaptive learning rates for each parameter (Kingma & Ba, 2015). The choice 

of Adam is motivated by its efficiency and effectiveness in handling sparse gradients and 

noisy data, which are common in real-world datasets. The optimizer is configured with a 

learning rate of 1𝑥10$H and a weight decay of 1𝑥10$I. A lower learning rate ensures 

that the model converges smoothly without overshooting the minimum. The weight de-

cay term is a regularization parameter, penalizing large weights to prevent overfitting 

and improve the model's generalization capabilities (Goodfellow et al., 2016, p. 229). 

 
5.3 Results 

Following the principles of a controlled experiment, this study established one independ-

ent variable and seven dependent variables. The independent variable in this study can 

be broadly described as the neural networks’ architecture. In controlled experiments, 

dependent variables are those expected to be affected by changes in independent vari-

ables. They can hence be used to measure results if it is also accepted that a change in 


51 

the dependent variable leads to a qualitative improvement in output data. The depend-

ent variables can be divided into four error calculations, two graphical visualizations and 

one subjective listening review. 

 
The selected error metrics were Mean Squared Error, Mean Absolute Error, Root Mean 

Squared Error, and Signal-to-Noise Ratio (SNR). As mentioned earlier, Mean Squared Er-

ror (MSE) was chosen as the validation loss function, given its effectiveness in regression 

tasks. In addition, Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) 

were computed to give further insights into model performance. MAE assesses the av-

erage size of the errors in predictions, providing a clear view of how far predictions de-

viate from actual values in the original scale, making it easier to interpret. RMSE, which 

is derived by taking the square root of MSE, is more responsive to larger errors, giving 

insight into possible outliers in predictions. Signal-to-Noise Ratio was also employed as 

a metric to measure the clarity of the reconstructed audio compared to the original. SNR 

assesses the level of the desired signal relative to the background noise, with higher val-

ues indicating better reconstructions. Compared to other error metrics, SNR provides a 

complementary view of model performance and is particularly suitable for audio pro-

cessing tasks. The validation process utilized two types of graphs: comparison graphs 

illustrating both the original and reconstructed waveforms and magnitude spectrum 

graphs comparing the frequency content of the validation and reconstructed data. Ulti-

mately, the most sensible way to evaluate the music's results is through listening. The 

listening and the analysis were both conducted by the author. 

 
𝑀𝐴𝐸 = ∑ |KL#$K#|
$
#
'

         (9)  

 
𝑅𝑀𝑆𝐸 = 	i&
'
∑ (𝑦% − 𝑦6%)('
%        (10)  

 
This study includes two training runs, one for each model, allowing them to be trained 

from scratch to assess their performance. An analysis of the results is conducted after 


52 

both training runs are complete. The values of the dependent variable were automati-

cally recorded at the beginning of the training and during training under two specific 

conditions: every 25th epoch and when the best validation loss improved. Figure 10 il-

lustrates the validation error progression for both models. Figure 10 gives an excellent 

general view of the performance of both models. It can be seen that the convolution 

model was worse in the beginning, but in a matter of a few epochs, it was able to surpass 

the base model. Also interesting is that the base model reached its minima in epoch 56 

and slowly started increasing after that. After that, the convolution model’s validation 

loss continued to decline up to 75 epochs and remained very close to the global minima. 

It is important to highlight that the graph's resolution is highest during the initial epochs 

and diminishes toward the end of the runs, based on the chosen reporting criteria. The 

loss is recorded only for every 25th epoch, provided it does not improve the validation 

loss. 

 
Figure 10. Evolution of the validation loss during the test with global minima. The blue 
curve represents the base model’s performance, and the red curve represents 
the convolution model’s performance. 

 
53 

Tables 1 and 2 give a deeper insight into the development of dependent variables. It 

should be noted that epoch numbering starts at zero, and at that point, one training 

round has already been completed. In the tables, the first three rows reflect the progress 

of models after one, five, and ten epochs. The fourth row presents each model's best 

result, while the remaining rows display all recorded outcomes following those best re-

sults. It is worth highlighting some notable findings from the results. Initially, the convo-

lution model had a higher validation loss compared to the base model. However, after 

just five epochs, the convolution model had halved its validation loss, whereas the base 

model only achieved a 30% reduction. Remarkably, it took only ten epochs for the con-

volution model to surpass the base model's lowest validation loss. By the 10th epoch, 

the base model reached results similar to what the convolution model had achieved in 

just half that number of epochs. 

 
Table 1. Summary of changes in the base model's dependent variables. 

Base model MSE MAE RMSE SNR 

Epoch 0 307.26 0.08 16.75 3.18 dB 

Epoch 4 202.67 0.06 13.44 5.61 dB 

Epoch 9 165.03 0.06 12.26 6.57 dB 

Epoch 56 124.20 0.05 11.09 8.42 dB 

Epoch 75 126.90 0.05 11.25 8.37 dB 

Epoch 100 128.29 0.05 11.37 8.42 dB 

Epoch 125 131.28 0.05 11.54 8.35 dB 

 
54 

Table 2. Summary of changes in the convolution model's dependent variables. 

Base model MSE MAE RMSE SNR 

Epoch 0 342.58 0.11 17.58 1.64 dB 

Epoch 4 167.56 0.06 12.39 6.32 dB 

Epoch 9 120.48 0.05 10.62 8.55 dB 

Epoch 75 91.05 0.04 9.69 10.23 dB 

Epoch 100 92.57 0.04 9.77 10.16 dB 

Epoch 125 91.91 0.04 9.74 10.27 dB 

 
The following Figures 11 to 20 offer an overview of the training processes for both 

models. These figures are divided into two groups: Figures 11 to 15 showcase amplitude 

waveform data, while Figures 16 to 20 illustrate the magnitude across frequency bins. 

Each figure represents an interesting stage in model progression, providing visual 

benchmarks of how closely each model approximates the target waveform and 

frequency distribution over successive epochs. 

 
55 

 
Figure 11. Waveform of the validation audio.  

 
Figure 12. The base model’s reconstruction data is depicted as a waveform after Epoch 
0. 

 
Figure 13. The base model’s reconstruction data is depicted as a waveform after Epoch 
56. 


56 

 
Figure 14. The convolution model’s reconstruction data is depicted as a waveform after 
Epoch 0. 

 
Figure 15. The convolution model’s reconstruction data is depicted as a waveform after 
Epoch 75. 

 
Figure 11 presents a waveform representation of the validation data, which serves as the 

target reference. Figures 12 and 13 illustrate the initial (Epoch 0) and optimal (Epoch 56) 

results for the base model. As the model progresses from Figure 12 to Figure 13, notable 

changes in the waveform appear particularly the development of distinct amplitude 

spikes that move closer to the target shape in Figure 11. Figures 14 and 15 similarly track 

the convolution model’s progression, with Figure 14 displaying the initial output (Epoch 

0) and Figure 15 presenting the optimal result (Epoch 75). Figure 14 exhibits more wave-

form volatility compared to Figure 12 but less than Figure 13. Interestingly, the convolu-

tional model's mean squared error (MSE) at Epoch 0 was worse than that of the base 


57 

model, although visually, the waveform in Figure 14 resembles the target validation data 

in Figure 11 more closely than the base model's Figure 12. Figure 15 showcases the con-

volutional model’s best result achieved after Epoch 75, and compared to earlier figures, 

it most closely resembles the target waveform in Figure 11. In comparison to the base 

model's best result shown in Figure 13, the individual amplitude spikes in Figure 15 are 

slightly stronger, making it more similar to Figure 11. However, Figure 15 is not perfect, 

lacking some of the mid-level amplitudes between weak and strong that are present in 

Figure 11. 

 
Figure 16. This represents the frequency distribution and corresponding magnitudes of 
the validation data. 

 
Figure 17. Frequency distribution and magnitude of the base model's reconstruction 
data after Epoch 0. 

 
58 

 
Figure 18. Frequency distribution and magnitude of the base model's reconstruction 
data after Epoch 56. 

 
Figure 19. Frequency distribution and magnitude of the convolution model's reconstruc-
tion data after Epoch 0. 

 
Figure 20. Frequency distribution and magnitude of the convolution model's reconstruc-
tion data after Epoch 75. 


59 

Figure 16 illustrates the frequency distribution and corresponding magnitudes of the val-

idation data, essentially showing how prominent each frequency range is within the tar-

get data. Figure 17, representing the base model's initial reconstruction at Epoch 0, 

shows only a single major frequency spike; unlike the broader distribution seen in Figure 

16, the mass in Figure 17 is concentrated in a narrow range. By Epoch 56, shown in Figure 

18, the base model’s frequency distribution widens, closely resembling Figure 16, but 

still missing smaller spikes at frequencies farther from the primary concentration. Fig-

ures 19 and 20 show the convolutional model's progression, starting with Epoch 0 in 

Figure 19, where the distribution already replicates much of Figure 16, with a wider base 

and even a secondary spike. However, this figure exhibits a cutoff beyond the 50kth fre-

quency bin, with no data present past that point. Figure 20, showing the convolutional 

model’s best result at Epoch 75, closely resembles Figure 16 with an even sharper cutoff 

than in Figure 19, beyond which no frequency data appears. 

 
5.4 Analysis 

This chapter focuses on analysing the models' results, considering the reasoning behind 

their performance and the factors that influenced their behaviour. The analysis can be 

divided into two sub-chapters: common features and separating features. Given the 

highly controlled environment and specific test conditions, one might expect that the 

models would share more commonalities than differences. Surprisingly, that is not the 

case. 

 
5.4.1 Common features 

Exploring audio data reconstruction of the models revealed a noteworthy commonality 

that deserves further examination.	One notable commonality between both neural net-

work models is the absence of sounds typically associated with higher frequencies—such 

as melodies or vocals—from the reconstructed audio. This observation raises the ques-

tion of why these higher-frequency elements were not effectively retained by either 


60 

model, particularly focusing on the potential role of downsampling and its compliance 

with the Nyquist theorem. 

 
A potential explanation for the lack of melodies could be the effects of downsampling. 

The Nyquist-Shannon sampling theorem states that the highest frequency that can be 

accurately captured by a system is half of the sampling rate (Shannon, 1949). For the 

audio used in this study, which was downsampled to 4410 Hz, the highest frequency that 

can be reconstructed without aliasing is 2205 Hz. However, this explanation alone seems 

insufficient, as most musical instruments produce fundamental frequencies well below 

this limit. 

 
To contextualize the discussion, it's beneficial to consider the frequency ranges of various 

musical instruments. For instance, a piano spans from 27.5 Hz to 4186 Hz (A0 to C8), a 

violin from 196 Hz to approximately 3500 Hz, and a guitar from 82 Hz to up to 1k Hz. The 

human voice typically operates at around 120 Hz for males and approximately 200 Hz for 

females, though female singers can reach frequencies as high as 1500 Hz (Pulkki & Kar-

jalainen, 2015, p. 82). These examples il