Mikael Metsälä Possibilities of convolutions in AI-reconstructed music Vaasa 2024 School of Technology and Innovations Master’s thesis in Computing Sciences 2 UNIVERSITY OF VAASA School of Technology and Innovations Author: Mikael Metsälä Title of the Thesis: Possibilities of convolutions in AI-reconstructed music Degree: Master of Science in Technology Programme: Automation and Computer Science Supervisor: Teemu Mäenpää Year: 2024 Page count: 71 ABSTRACT: This thesis investigates the application of convolutional layers within an autoencoder to reconstruct one-dimensional audio data in systems with limited computational resources. The primary objective of this study is to explore whether convolutional layers could improve autoencoder performance by retaining key audio characteristics during the reconstruction process. While deep generative models have shown promise for audio synthesis, research has predominantly focused on large-scale implementations, leaving open questions about the adaptability of these approaches to smaller systems. This study hypothesized that convolutional layers would enable improved reconstructions compared to fully connected (FC) layers within a limited VRAM environment. To test this hypothesis, a controlled experimental approach was employed, which involved a detailed comparison of the performance of both fully connected and convolutional architectures. Each model was trained from scratch on one-dimensional audio sequences until reaching convergence. This approach allowed for a clear and precise evaluation of the relative effectiveness of each model type. To ensure a comprehensive assessment, several key metrics were selected, including mean squared error as one of the primary metrics, alongside observa- tions of convergence rate and memory efficiency to evaluate model performance. The findings indicate that the convolutional autoencoder achieved superior reconstruction qual- ity, as evidenced by its lower mean squared error and faster epoch-wise progression to accuracy, despite taking slightly longer per epoch than the FC model. These results highlight convolutional architectures' potential to facilitate high-quality audio reconstruction on smaller systems, mak- ing advanced AI-driven audio analysis more accessible. The convolutional model’s ability to rep- resent low-frequency components more effectively and with less added noise than the FC model supports the hypothesis, although challenges, such as limitations in replicating high-frequency components, were noted in both models. Overall, these results suggest that convolutional au- toencoders could offer a promising approach for efficiently reconstructing audio data on con- strained hardware. The study contributes valuable insights to music analysis and AI audio research, particularly in the context of scalable model design for low-resource environments. It acknowledges limitations, such as subjective sound quality assessment and hardware constraints, and recommends future work. Further research might focus on enhancing frequency representation within convolutional networks and improving audio separation capabilities. By advancing methods that operate effectively on smaller systems, this study encourages further exploration of accessible AI applications in music analysis and digital audio processing. KEYWORDS: audio processing, artificial intelligence, machine learning, neural networks 3 VAASAN YLIOPISTO Tekniikan ja innovaatiojohtamisen akateeminen yksikkö Tekijä: Mikael Metsälä Tutkielman nimi: Possibilities of convolutions in AI-reconstructed music Tutkinto: Diplomi-insinööri Oppiaine: Automaatio ja tietotekniikka Työn ohjaaja: Teemu Mäenpää Valmistumisvuosi: 2024 Sivumäärä: 71 TIIVISTELMÄ: Tässä tutkielmassa tarkastellaan rajatuilla laskentaresurseilla toimivan konvoluutiokerroksia hyödyntävän autoenkoodaajan soveltamista audiosignaalin rekonstruointiin. Tavoitteena on selvittää, voivatko konvoluutiokerrokset parantaa autoenkoodaajan oppimiskykyä ja auttaa sitä säilyttämään musiikille ominaisia piirteitä rekonstruointiprosessin aikana. Aiemmissa tutkimuk- sissa on todistettu syvien generatiivisten mallien kyky audiosynteesissä, kun käytössä on ollut valtavasti laskentatehoa ja muistia, mikä jättää kysymyksen avoimeksi pienemmän laskentate- hon omaavien järjestelmien osalta. Hypoteesina tässä tutkimuksessa on, että konvoluutioker- rokset voivat tarjota parempaa rekonstruktiota kuin täysin kytketyt kerrokset rajallisesti keskus- muistia sisältävissä järjestelmissä. Hypoteesin testaamiseksi toteutettiin vertailukoe, jossa verrattiin täysin kytketyistä kerroksista koostuvan neuroverkon ja konvoluutiopohjaisten verkon suorituskykyä. Molemmat mallit kou- lutettiin audiodatan avulla, kunnes ne saavuttivat konvergenssin. Näin saatiin selkeä ja tarkka vertailu arkkitehtuurien tehokkuudesta. Mallien suorituskykyä arvioitiin ensisijaisesti keskineliö- virheen avulla, ja lisäksi tarkasteltiin konvergenssinopeutta ja käytetyn muistin määrää. Tutkimuksen tulokset osoittavat, että konvoluutiokerroksia sisältävä autoenkoodaaja rekonst- ruoi audiosignaalia paremmin, mikä käy ilmi sen matalammasta keskineliövirheestä sekä sen tuottamasta pienemmästä kohinan määrästä. Näiden tulosten perusteella konvoluutioarkkiteh- tuuri osoittaa potentiaalia korkealaatuisen audionsignaalin rekonstruointiin laskentateholtaan rajatuissa järjestelmissä, mikä parantaa tällaisten tekoälyyn perustuvien järjestelmien saavutet- tavuutta. Molemmissa malleissa havaittiin haasteita korkeiden taajuuksien rekonstruoinnissa. Johtopäätöksenä voidaan todeta, että konvoluutiokerrokset parantavat autoenkoodaajan kykyä rekonstruoida audiosignaalia, erityisesti matalilla taajuuksilla ja vähentämällä kohinaa, mikä mahdollistaa mallin käyttämisen myös laskentateholtaan rajatuissa järjestelmissä. Tämä osoit- taa konvoluutioon pohjautuvien arkkitehtuurien potentiaalin laadukkaaseen audiodatan re- konstruointiin ja mahdollistaa tekoälyn soveltamisen musiikkianalyysissä ja äänenkäsittelyssä laajemmalle yleisölle. Tulevissa tutkimuksissa voitaisiin keskittyä konvoluutiomallien kykyyn ero- tella eri taajuuskomponentteja entistä tarkemmin sekä parantaa niiden suorituskykyä korkeiden taajuuksien käsittelyssä. AVAINSANAT: audio processing, artificial intelligence, machine learning, neural networks 4 Contents 1 Introduction 9 2 Basic concepts and technologies 12 2.1 Categorisation 12 2.1.1 Parameter-based 12 2.1.2 Non-parameter-based 13 2.2 Deep Neural Networks 13 2.2.1 Model 13 2.2.2 Dataset 15 2.2.3 Objective Function 17 2.2.4 Optimisation Procedure 19 2.3 Neural Network Architectures 20 2.3.1 Autoencoders 20 2.3.2 Variational Autoencoders 21 2.3.3 Generative Adversarial Networks 23 2.3.4 Transformers 25 3 Related Work 29 3.1 WaveNet 29 3.1.1 Causal and Dilated Convolutions 30 3.1.2 Non-linear Quantisation 31 3.2 Jukebox 32 3.2.1 Multi-scale VQ-VAE 32 3.2.2 Scalable Transformers and Upsampling 34 3.3 Summary 35 4 Methodology 38 5 Experiment Design and Model Evolution 40 5.1 Dataset and System 41 5.2 Model Evolution 42 5.2.1 Preprocessing 42 5 5.2.2 Model development 46 5.2.3 Testing setup 49 5.3 Results 50 5.4 Analysis 59 5.4.1 Common features 59 5.4.2 Separating features 61 6 Conclusion 66 References 68 6 Pictures Picture 1. Screenshot of Task Manager during convolutional model training. 64 Picture 2. Screenshot of Task Manager during base model training. 65 Figures Figure 1. A simple feedforward neural network by Goodfellow et al. (2016, p. 170) 14 Figure 3. Variational Autoencoder depicted as a figure. Blue boxes represent the encoder, white is the latent space, and green boxes represent the decoder. 𝒙 is input, and 𝒙 represents output. 22 Figure 2. Generative Adversarial Network architecture depicted by Bengesi et al. (2023). 24 Figure 4. Scaled Dot-Product Attention architecture (Vaswani et al. 2017). 26 Figure 5. Transformer architecture (Vaswani et al. 2017). The figure consists of an encoder on the left and a decoder on the right. 28 Figure 6. Different types of convolutions. Blue dots depict the input layer, grey marks the hidden layers, and yellow is the output layer. 30 Figure 7. Effects of μ-law companding on a waveform. 45 Figure 8. The encoder part of the Convolutional Autoencoder is in blue, and the first latent vector is in white. 48 Figure 9. The decoder part of the Convolutional Autoencoder is in green, and the second latent vector is in white. 49 Figure 10. Evolution of the validation loss during the test with global minima. The blue curve represents the base model’s performance, and the red curve represents the convolution model’s performance. 52 Figure 11. Waveform of the validation audio. 55 Figure 12. The base model’s reconstruction data is depicted as a waveform after Epoch 0. 55 Figure 13. The base model’s reconstruction data is depicted as a waveform after Epoch 56. 55 7 Figure 14. The convolution model’s reconstruction data is depicted as a waveform after Epoch 0. 56 Figure 15. The convolution model’s reconstruction data is depicted as a waveform after Epoch 75. 56 Figure 16. This represents the frequency distribution and corresponding magnitudes of the validation data. 57 Figure 17. Frequency distribution and magnitude of the base model's reconstruction data after Epoch 0. 57 Figure 18. Frequency distribution and magnitude of the base model's reconstruction data after Epoch 56. 58 Figure 19. Frequency distribution and magnitude of the convolution model's reconstruction data after Epoch 0. 58 Figure 20. Frequency distribution and magnitude of the convolution model's reconstruction data after Epoch 75. 58 Tables Table 1. Summary of changes in the base model's dependent variables. 53 Table 2. Summary of changes in the convolution model's dependent variables. 54 Abbreviations ADAM Adaptive Moment Estimation AI Artificial Intelligence BERT Bidirectional Encoder Representations from Transformer CNN Convolutional Neural Network CV Computer Vision EDM Electronic Dance Music FT Fully Connected GAN Generative Adversarial Network GPT Generative Pre-trained Transformer GPU Graphics Processing Unit KL Kullback-Liebler MAE Mean Absolute Error Multilayer Perceptron MLP Multilayer Perceptron 8 MSE Mean Squared Error NLP Natural Language Processing RAM Random Access Memory ReLU Rectified Linear Unit RFFT Real Fast Fourier Transform RMSE Root-Mean-Square Error RNN Recurrent Neural Network stochastic gradient descent SGD Stochastic Gradient Descent SNR Signal-to-Noise Ratio STFT Short-Time Fourier Transform VAE Variational Autoencoder ViT Vision Transformer VQ-VAE Vector Quantized Variational Autoencoder VRAM Video Random Access Memory 9 1 Introduction Music generated with the help of Artificial Intelligence is a topic that has puzzled re- searchers for decades. AI-generated music research took its first steps as early as the 1950s when Hiller Jr. and Isaacson (1957) released a model that generated sheet music based on the Markov chain model. Many of these early parameter-based generative models were not multi-layered neural networks (Zhu et al., 2023), which at the time suf- fered from a lack of efficient training methods (Briot et al., 2019, p. 41). In the 2006 paper A Fast Learning Algorithm for Deep Belief Nets, Hinton et al. introduced a solution to this, paving the way for the rise of deep neural networks, and in 2012, AlexNet, a deep neural network, won the ImageNet image recognition competition, resulting in a para- digm shift, making deep learning the state-of-the-art solution for prediction problems (Briot et al., 2019, pp. 3, 41–42). Deep learning is a vague term as it does not share a scientifically agreed-upon definition (Briot et al., 2019, p. 3). As a part of artificial intelligence, it usually refers to machine learning done with deep neural networks consisting of multiple layers that hierarchically extract and abstract data (Briot et al., 2019, p. 3). Briot et al. (2019, p. 3) highlight three major milestones that have fuelled the surge of deep learning: an increase in the quan- tity of data available, enhanced availability of computational resources, and technologi- cal advances, notably the meaningful application of convolutions, which are particularly relevant to the context of this thesis. This is supported by Bengesi et al. (2023) as they identified that prior to 2010, interest in deep learning was hindered by the limited avail- ability of computing resources and insufficient large datasets. After the major roadblocks were overcome and deep learning research gained wind in its sails, deep learning took a new course in 2013 when Kingma and Welling released Variational Autoencoder (VAE), followed by Goodfellow et al. (2014) with their Genera- tive Adversarial Network (GAN), building the foundation for Generative Artificial Intelli- gence. At the time of their introduction, these new types of neural networks aimed to capture the underlying probability distribution of the data (Goodfellow et al., 2016, pp. 10 693, 697). The benefit of learning the distribution is the possibility of sampling it and generating novel data instances that resemble the original data (Goodfellow et al., 2016, p. 707) These networks mainly operated with relatively small-resolution images where the input for the network was a complete image like the MNIST database of handwritten numbers (Goodfellow et al. 2014). One image from MNIST has a 28x28 resolution (LeCun et al., 1998). However, a 2-minute song sampled with the usual 44,1 kHz has roughly 5 million input parameters compared to an image of the MNIST set, which has a little less than 800 input parameters. As Dhariwal et al. (2020) note, it is very computationally demand- ing. This has led to new solutions in which the input is split into segments that are fed to the network as separate instances. This “segmented learning” can still work for non-generative tasks. However, randomly sampling the latent space for multiple audio segments and combining them is unlikely to create a coherent song. This problem has given birth to autoregressive networks that calculate the probability of each new sample as a joint probability over all previous sam- ples (van den Oord et al., 2016). Such networks are designed to generate long and co- herent audio and usually consist of two separate neural networks, which are autoencod- ing and autoregressive in nature (Dhariwal et al., 2020). The first layers of the autoen- coder are often convolutional layers designed to maximise the receptive field of the net- work, making it easier to model longer temporal dependencies (van den Oord et al., 2016). The main aim of this thesis is to explore the use of convolutions for one-dimensional sequential audio, focusing on the regenerative capabilities of autoencoders in music. The inspiration for this study arises from the challenges identified by Dieleman et al. (2018), particularly the lack of long-term structure in AI-generated music. Previous research has demonstrated the feasibility of capturing local structures like timbre, but modelling higher-level structures, such as verses and choruses, remains elusive (Dieleman et al., 11 2018). However, generative AI systems, especially in music, typically require extensive computational resources to achieve coherent and high-quality output (Dhariwal et al., 2020). This study aims to explore approaches that can be implemented on smaller sys- tems, offering feasible solutions for researchers without access to vast computational resources. This study seeks to answer the question: Can convolutional architecture en- hance the regenerative performance of autoencoder on one-dimensional audio data? This study hypothesizes that applying convolutions will improve the autoencoder's abil- ity to capture and preserve essential structural features in the reconstructed output, thereby enhancing its understanding of relationships between preceding and succeeding data points. This study follows a controlled experimental design to explore the effects of different neural network architectures on AI-regenerated music. A controlled experiment allows for precise manipulation of independent variables—such as the type of neural network used—and careful observation of their effects on the dependent variables, including the quality and characteristics of the music. By creating a structured environment, this study aims to isolate specific factors and assess their impact on model performance. The meth- odology employed in this thesis provides a systematic approach to validate the hypoth- esis and contribute to the broader understanding of neural network architectures in re- generative music through empirical evidence. This study is divided into six chapters. The second chapter consists of an overview of neural networks, aiming to give readers an understanding of the basic concepts and technologies surrounding the field. The third chapter examines two influential studies that inspired this research. The fourth chapter discusses the methodology behind this study. Chapter five depicts the model and describes the experiment. The last chapter concludes this process and discusses possible future directions for this line of study. 12 2 Basic concepts and technologies Artificial intelligence is a vast field that is constantly expanding. Its subset is Generative Artificial intelligence, which has recently become more popular in the eyes of the public (Bengesi et al., 2023). It can be hard to get a grasp of the field as it is moving so fast, but using categories can help make it easier to understand. There are various ways to cate- gorise the process of generating music using AI models (Zhu et al., 2023; Bengesi et al., 2023). In addition to categorisation, this chapter aims to explain the building blocks of a generic neural network and provides an overview of popular Generative Artificial Intelli- gence architectures, their functionalities, and how they operate. 2.1 Categorisation In their 2023 survey, Zhu et al. introduced an approach that divides models into two categories: parameter-based and non-parameter-based. A characteristic of this ap- proach is that the models are differentiated by the type of input they use (Zhu et al., 2023). The non-parameter-based category is still divided into two subcategories: prompt-based and visual-based models. Generative models can also be divided by model architecture, as shown by Bengesi et al. (2023). Architectures that have gained popularity are Generative Adversarial Networks, Variational Autoencoders, and Transformers (Bengesi et al., 2023). These architectures are described in-depth later in this chapter. 2.1.1 Parameter-based Parameter-based models represent the majority of the models listed by Zhu et al. (2023). These models range from Hiller Jr.’s and Isaacson’s (1957) Markov chain models to Dhari- wal et al.'s (2020) multi-scale Vector Quantized Variational Autoencoder types of deep neural networks. Common to these models is that they require specific input parameters such as tempo or key (Zhu et al., 2023). 13 2.1.2 Non-parameter-based A good example of prompt-based models is MusicLM, developed by Agostinelli et al. (2023). The model takes the text prompt as an input and uses sequence-to-sequence modelling to generate multiple-minute-long songs that adhere to the text prompt (Agostinelli et al., 2023). Applications of visual-based models like V-MusProd by Zhuo et al. (2022) include generating background music for videos by conditioning the model with images or video. 2.2 Deep Neural Networks As Bengio et al. (2012) neatly put it, AI's goal is to ”understand the world around us”. The pursuit of this goal has led researchers to turn to deep learning, which, as previously established, involves the use of deep neural networks for machine learning (Briot et al., 2019, p. 3). Goodfellow et al. (2016, p. 151) describe the fundamental independent ele- ments necessary for constructing such a machine learning algorithm as a model, a da- taset, an objective function, and an optimization procedure. Next, this study expands on these four basic elements and what are their implications. 2.2.1 Model In the context of machine learning, the model, often referred to as a neural network, is an artificial construction that mimics the neurons of the human brain (Nwadiugwu, 2021). Each neuron in a neural network has attributes called a weight and a bias; these two, together with the input and activation function, for example, sigmoid activation, determine the strength of the signal that is passed to the neuron in the next layer (Good- fellow et al., 2016, pp. 107, 65-66). This relationship is defined in Equation 1, and in Fig- ure 1, each arrow represents a weight. In the model, the flow of the information or signal happens in two passes, forward pass and backward pass, also referred to as forward propagation and back-propagation (Goodfellow et al., 2016, p. 200). 14 𝑦# = 𝜎(𝑤𝑥 + 𝑏) (1) A feedforward network, or Multilayer Perceptron (MLP), is a basic neural network archi- tecture composed of layers of neurons where, in the forward pass, the signals flow uni- directionally from the input layer to the output layer (Goodfellow et al., 2016, p.164). Figure 1 illustrates a typical feedforward network, designed to approximate a function by mapping inputs to outputs, consists of a clearly defined structure with multiple layers: an input layer 𝑥, one hidden layer ℎ, and an output layer 𝑦 (Goodfellow et al., 2016, pp. 164-165). While feedforward networks are foundational, other architectures like Convo- lutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) cater to specific data types and tasks, leveraging unique structural features to process spatial and se- quential data, respectively (Goodfellow et al., 2016, pp. 326, 367). Figure 1. A simple feedforward neural network by Goodfellow et al. (2016, p. 170) CNNs are a specialised type of feedforward network where one or more layers are re- placed with convolutional layers (Goodfellow et al., 2016, p. 326). The convolutional op- eration in these layers involves sliding a filter or kernel 𝑤 over the input data 𝑥 to pro- duce a feature map 𝑠 (Goodfellow et al., 2016, p. 328). This process captures local 15 patterns by applying the same filter across various parts of the input, thus enabling CNNs to process spatial or multidimensional data like images efficiently (Goodfellow et al., pp. 330-333). To better understand the convolutional operation, consider a simplified one- dimensional example: a signal 𝑥 is processed by a shorter filter 𝑤 , which is a set of weights. During convolution, the filter 𝑤 is slid along the signal 𝑥, with the element-wise multiplication of the filter and the segment of the signal filter covers being computed at each position. 𝑠(𝑡) = (𝑥 ∗ 𝑤)(𝑡) = ∑ 𝑥(𝑎)𝑤(𝑡 − 𝑎)! "#$! (2) Mathematically, the convolution operation is depicted by Goodfellow et al. (2016, p. 327) as Equation 2, where the asterisk (*) denotes convolution. In this equation, 𝑡 represents the index of the element in the feature map that is being calculated. Importantly, the filter 𝑤 is shorter than the signal 𝑥, which leads to an interesting property in the equa- tion: an infinite sum calculated in a finite space (Goodfellow et al., 2016, p. 328). This is because outside the bounds of the filter, 𝑤 is zero, resulting in any multiplication involv- ing 𝑤 outside its bounds also being zero. Another notable property involves 𝑡 − 𝑎, which suggests that the filter is reversed, a process often called flipping (Goodfellow et al., 2016, p. 328). However, Goodfellow et al. (2016, p. 329) note that in machine learning imple- mentations, this flipping is often not performed, and the operation should more accu- rately be called cross-correlation, though it is still commonly referred to as convolution. 2.2.2 Dataset The dataset consists of the representation or features of the data and plays a crucial role in the model's performance (Bengio et al., 2012). This connection between representa- tion and performance has inspired researchers to develop algorithms capable of repre- sentation learning, which has limited the requirement for feature engineering (Bengio et al., 2012). Feature engineering refers to a process where the dataset is manually con- figured into a form that is more acceptable for the model (Bengio et al., 2012). The re- duced need for manual labour has speeded up the process of utilising artificial 16 intelligence (Bengio et al., 2012). In its essence, representation learning refers to a con- cept in which the neural network is presented with raw data and during training, it au- tomatically learns the meaningful features of the data (Bengio et al., 2012). There are multiple different ways to make a neural network learn, as described in the book Deep Learning by Goodfellow et al. (2016, pp. 103-104). Two of the most prominent ones are called supervised and unsupervised learning. Alt- hough there is no formal definition, supervised learning typically involves solving regres- sion and classification problems, while unsupervised learning aims to understand the underlying probability distribution of the data (Goodfellow et al., 2016, p. 103). One way to understand this is that in supervised learning, the model is given a label 𝑦 in addition to the data 𝑥, and it tries to do classification by learning the probability 𝑝(𝑦|𝑥) (Good- fellow et al., 2016, p. 103). In the unsupervised learning process, the model tries to learn the underlying probability distribution 𝑝(𝑥) automatically (Goodfellow et al., 2016, p. 142). The model can then be used in, for example, anomaly detection, where deviations from the expected distribution can signal atypical events (Goodfellow et al., 2016, p. 100). A third learning method derived from the above-mentioned approaches is self-super- vised learning, which is usually associated with more complex deep neural networks (Er- icsson et al., 2022). Ericsson et al. (2022) have divided self-supervised learning into a pretext task and a downstream task. In the pretext task, the model uses unsupervised learning to capture a meaningful data representation, for example, in a lower dimension (Ericsson et al., 2022). The downstream task then utilises this new domain for improved learning (Ericson et al., 2022). The related works chapter discusses OpenAI’s music gen- eration model, Jukebox, which can be thought to represent this learning method where Vector Quantized Variational Autoencoder (VQ-VAE) training is considered the pretext task and Scalable Transformer training matches the downstream task description. 17 2.2.3 Objective Function An objective function is a mathematical function that guides a machine learning model to adjust its weight and bias parameters in an attempt to minimise or maximise the ob- jective function (Goodfellow et al., 2016, p. 80). In supervised learning, the goal is often to minimise the difference between inputs and outputs, and this is measured with a cost function, also known as a loss function (Goodfellow et al., 2016, p. 80; Nielsen, 2015, p. 16). In regression tasks, also known as quantitative tasks, the Mean Squared Error (MSE) is widely used to quantify the average of the squares of the errors, effectively measuring the variance between estimated and actual values (James et al., 2023, p. 28). In contrast, for classification tasks or qualitative tasks, Cross-Entropy Loss is frequently employed as it quantifies the divergence between the actual labels and the predicted labels and re- sults in faster convergence compared to Mean Squared Error (James et al., 2023, p. 28; Nielsen, 2015, p. 63). Equation 3 depicts MSE and in the equation 𝑦% is the actual value, and 𝑦6% is the predicted value. 𝑀𝑆𝐸 = & ' ∑ (𝑦% − 𝑦6%)(' % (3) As the cost function is used to direct learning, it can be thought that decreasing cost is a sign of learning (James et al., 2023, p. 28). However, this is not always the case, as ma- chine learning models often suffer from overfitting, a phenomenon where the model learns the training data very well but fails to generalise effectively to new, unseen test data (Nielsen, 2015, p. 75; Goodfellow et al., 2016, pp. 109). More specifically, it can occur when the model parameter count is high and the amount of training data is low (Nielsen, 2015, p. 74). Luckily, it is not necessarily a sign that the model is inherently unable to learn, as prolonged training can be the cause of overfitting (Nielsen, 2015, p.75). 18 To prevent overfitting, regularisation techniques such as L2 regularization are introduced into the cost function (Nielsen, 2015, p. 79). Regularization tries to ensure the model does not overly adapt to the noise within the training data (Nielsen, 2015, p. 84). In L2 regularization, a regularization term is summed to the cost function, and in a machine learning setting, it is often squared L2 norm depicted in Equation 4, where 𝜆 is the reg- ularization parameter, which balances how well the model fits the data and how diverse the weight domain gets (Nielsen, 2015, p. 79; Goodfellow et al., 2016, pp. 117, 227). 𝜆‖𝑤‖(( = 𝜆∑ 𝑤( ) (4) Nielsen (2015, p. 86) states that there is no entirely convincing theoretical explanation that explains why regularization works. Regularization simplifies the network, and that is often offered as a general scientific principle as to why it works, but Nielsen (2015, p. 85) points out that simpler does not always equal better. Goodfellow et al. (2016, pp. 117-118) talk about the importance of domain knowledge when designing machine- learning solutions and how excessive regularization can hinder the model’s ability to learn and lead to underfitting. Underfitting is the opposite of overfitting, and it occurs when the model is unable to learn the training data (Goodfellow et al., 2016, p. 109). The use of regularization boils down to a bias-variance trade-off described by James et al. (2023, pp. 242-243), where increased regularization decreases variance but increases bias. Based on the above, it can be derived that the goal of regularisation is to restrict the model’s capacity to overfit while still having a low enough bias that the model does not underfit. Finding this balance is crucial when trying to achieve the best possible gen- eralization. More complex neural networks have different types of regularisation methods, one of which is Kullback-Liebler (KL) Divergence, which is used as a regularisation term in VAEs (Goodfellow et al., 2016, p. 693). It penalises deviations from expected probability dis- tributions, ensuring desirable properties such as continuity for the posterior distribution 19 (Goodfellow et al., 2016, p. 72). These cost functions are foundational to the optimisa- tion procedure, which is discussed in the subsequent section. 2.2.4 Optimisation Procedure An optimisation procedure is generally a very difficult task, and it is also the fourth ele- ment of a machine learning algorithm described by Goodfellow et al. (2016, pp. 151, 279). It refers to the process of minimising or maximising the objective function 𝑓(𝑥) by optimizing 𝑥 (Goodfellow et al., 2016, p. 80). One of the most prominent optimisation procedures is gradient-based optimisation (Goodfellow et al., 2016, p. 80). Previously, gradient-based optimization was described as “slow or unreliable”, but since it has been accepted that it provides useful results in a reasonable time even though it does not always give the optimal solution (Goodfellow et al., 2016, p. 150). In other words, gradi- ent descent converges to a local minimum or close to it but seldom finds the global min- imum. Gradient descent is a process that utilises partial and directional derivatives to calculate the gradient Δ*𝑓(𝑥) and the objective is to determine the direction that de- creases 𝑓(𝑥) the most rapidly (Goodfellow et al., 2016, pp. 82-83). Gradient shows the direction of the steepest ascent (Goodfellow et al., 2016, p. 83). Equation 5 describes the optimisation of 𝑥 by nudging it in the direction of the negative gradient (downhill). The coefficient 𝜖 is called the learning rate, and it determines the step length for the optimisation process (Goodfellow et al., 2016, p. 84). Usually, the learning rate is a small constant (Goodfellow et al., 2016, p. 84). 𝑥+ = 𝑥 − 𝜖Δ*𝑓(𝑥) (5) Nowadays, the machine learning field is dominated by the stochastic gradient descent (SGD) algorithm (Goodfellow et al., 2016, p. 150). SGD is an extension of basic gradient descent, and its existence becomes obvious when larger and larger training sets are in- troduced to improve generalization (Goodfellow et al., 2016, p. 149). The problem with regular gradient descent is that when the amount of data 𝑥 grows so does the computa- tional cost (Goodfellow et al., 2016, p. 149). This is represented with 𝑂(𝑥) which means 20 that the cost is linear (Goodfellow et al., 2016, p. 149). SGD solves this problem by draw- ing a uniform representation from the data called a minibatch and doing the gradient calculation on that limited number of samples (Goodfellow et al., 2016, p. 149). This leads to the computational cost becoming independent of the amount of data, and it is denoted with 𝑂(1), meaning that the computational cost is constant. (Goodfellow et al., 2016, p. 149). 2.3 Neural Network Architectures Neural network architectures encompass a wide range of models that are based on the idea of deep neural networks, yet models based on these architectures differ in their objectives and applications. While some neural network architectures, such as Genera- tive Adversarial Networks (GANs) and Variational Autoencoders (VAEs), are designed to generate new data based on learned probability distributions, other architectures, such as standard Autoencoders, focus on data reconstruction (Bengesi et al., 2023; Goodfel- low et al., 2016, p. 499). This chapter will discuss key neural network architectures, in- cluding Autoencoders, Variational Autoencoders, Generative Adversarial Networks, and Transformers, highlighting their purposes and underlying mechanisms. 2.3.1 Autoencoders Autoencoders are a neural network architecture focused on unsupervised learning tasks, including dimensionality reduction and feature extraction (Goodfellow et al., 2016, p. 499). The foundation for autoencoders was laid by Rumelhart, Hinton, and Williams in 1986 with the introduction of backpropagation, enabling neural networks to learn inter- nal representations (Rumelhart et al., 1986). The concept of autoencoders as a specific neural network structure was later advanced by researchers such as LeCun (1987), Bour- lard and Kamp (1988), and Hinton and Zemel (1994) (Goodfellow et al., 2016, p. 499). Typically, an autoencoder comprises three main components: the encoder, a bottleneck or latent vector, and the decoder (Goodfellow et al., pp. 499-500). This structure is sim- ilar to the one illustrated in Figure 3. As the data flows through the network (from left to 21 right in Figure 3), the encoder reduces the size of the input data to fit through the bot- tleneck of the latent space, and the decoder tries to reconstruct the original input from this representation (Goodfellow et al., 2016, p. 499). Training an autoencoder involves using backpropagation and gradient descent to mini- mise the reconstruction error between the input and the output (Goodfellow et al., 2016, p. 499). This process allows the neural network to capture essential features of the data while discarding irrelevant information (Rumelhart et al., 1986; Goodfellow et al., 2016, p. 499). Goodfellow et al. (2016, p. 499) highlight that learning results can be improved by feeding the network incomplete input data and calculating the error with recon- structed data and complete input data. Autoencoders are widely used for various purposes, including noise reduction, where they learn to reconstruct clean data from noisy inputs (Vincent et al., 2008; Goodfellow et al., 2016, p. 499). They are also applied in anomaly detection, where deviations be- tween the input data and its reconstruction indicate unusual patterns, making them val- uable for identifying outliers in data (Chalapathy & Chawla, 2019). Moreover, autoen- coders serve in dimensionality reduction, compressing high-dimensional data into a more manageable form and aiding in tasks such as data visualisation and feature extrac- tion (Hinton & Salakhutdinov, 2006). These applications highlight the versatility of auto- encoders across various unsupervised learning tasks. 2.3.2 Variational Autoencoders The concept of Variational Autoencoding was introduced in the paper Auto-Encoding Variational Bayes by Kingma and Welling in 2014. It was developed as a general solution for problems with intractable posterior distributions 𝑝(𝑧|𝑥) in which parameters or la- tent variables are continuous. The proposed solution utilises the stochastic gradient var- iational Bayes estimator to optimise the approximate posterior distribution 𝑞(𝑧|𝑥). 22 As described in van den Oord et al.'s 2017 paper Neural Discrete Representation Learn- ing, a Variational Autoencoder consists of practically three parts: an encoder, a latent space, and a decoder (illustrated by Figure 3), where the latent space is often continuous (van den Oord et al., 2017). These components map probability distributions that are called a posterior distribution 𝑞(𝑧|𝑥), a prior distribution 𝑝(𝑧), and 𝑝(𝑥|𝑧) probability distribution (van den Oord et al., 2017). Kingma and Welling (2013) refer to 𝑞(𝑧|𝑥) as a probabilistic encoder and 𝑝(𝑥|𝑧) as a probabilistic decoder. Initially, a prior distribution 𝑝(𝑧) is defined, representing the latent space’s expected shape or form, and it directs the posterior distribution in a specific direction; usually, the prior is standard Gaussian (van den Oord et al., 2017). Figure 2. Variational Autoencoder depicted as a figure. Blue boxes represent the encoder, white is the latent space, and green boxes represent the decoder. 𝒙 is input, and 𝒙C represents output. During the Variable Autoencoder's training, the model learns to refine the mappings of the encoder and decoder by optimising the parameters that define these conditional probability distributions 𝑞(𝑧|𝑥), the probability of 𝑧 given 𝑥, and 𝑝(𝑥|𝑧), likelihood of observing 𝑥 given 𝑧 (van den Oord et al., 2017). To enable effective training through sto- chastic gradient descent, the latent variable 𝑧 is transformed into a deterministic func- tion 𝑔,(𝜖, 𝑥), parameterised by 𝜙, and an independent noise variable 𝜖 , sampled from a standard distribution (Kingma & Welling, 2013). This method, called the reparameter- ization trick by Kingma and Welling (2013), allows for gradient flow during backpropaga- tion. This can also be visualised with the following code snippet of the 23 reparameterization function, where mu and log_var are the 𝜙 parameters and eps is the noise variable 𝜖. def reparametrize(self, mu, log_var): std = np.exp(0.5 * log_var) eps = np.rand_like(std) return mu + eps * std VAEs have been applied in numerous fields, such as image processing, medical applica- tions, and language modelling (Wei et al., 2020). In the image processing category, it is considered to be state-of-the-art in image classification, image compression, and image resolution (Bengesi et al., 2023). More recent advancements have been made in the 3D imaging domain, as VAE enables the efficient compression of high-dimensional spaces (Molnár & Tamás, 2024). In 2017, van den Oord et al. introduced a variant of VAEs called a Vector Quantized Var- iational Autoencoder (VQ-VAE). Unlike traditional Variational Autoencoders, which have a continuous latent space, VQ-VAEs have a discrete latent space. This type of autoen- coder uses a codebook in the quantization process, which makes the latent space dis- crete (van den Oord et al., 2017). VQ-VAEs are discussed more in-depth in the related works chapter. 2.3.3 Generative Adversarial Networks As previously mentioned, the Generative Adversarial Network was first developed by Goodfellow et al. (2014). The architecture of GAN comprises two neural networks that compete with each other. The first network is called a generator, denoted by G, which aims to generate convincing samples that can deceive the second network. The second network is known as the discriminative network depicted by D, which aims to differenti- ate between generated samples and real data (Goodfellow et al., 2014). GAN architec- ture is visualised in Figure 2. 24 Training the model’s two separate networks happens simultaneously (Goodfellow et al., 2014). Convergence is considered to be reached when the discriminator network classi- fication probability approaches 0.5, and the generator network’s probability distribution resembles the distribution of the real data (Goodfellow et al., 2014). The advantages listed by Goodfellow et al. (2014) are mainly computation improvements compared to previous models. A unique aspect of the model that sets it apart from other generative architectures discussed in this study is that the generator network is not directly exposed to embeddings of real data, which, according to Goodfellow et al. (2014), may lead to some statistical advantages. Figure 3. Generative Adversarial Network architecture depicted by Bengesi et al. (2023). Generative Adversarial Networks (GANs) have been extensively studied since, as evi- denced by Jabbar et al.'s 2020 survey on Generative Adversarial Networks: Variants, Ap- plications, and Training. Applications include image generation in the form of hand- 25 written font, image blending, texture synthesis, and 3D image synthesis (Jabbar et al., 2020). Other notable implementations mentioned by Jabbar et al. (2020) are music gen- eration, video synthesis, and applications in the medical field. 2.3.4 Transformers Transformers were initially developed by Vaswani et al. (2017) in a study titled Attention Is All You Need. The transformer networks consist of encoder and decoder networks, each forming a stack of N identical layers. The fundamental unit of these layers is known as the Attention Head, which is responsible for updating each embedding based on its relationship with surrounding embeddings (Vaswani et al., 2017). In the paper, Vaswani et al. (2017) describe stacking these attention heads to create Multi-Headed Attention. An overview of the Transformer model architecture is shown in Figure 5. Even though the Transformer model’s architecture is similar to the previously mentioned Variational Autoencoder architecture, which consists of encoder and decoder networks, there are some interesting differences worth examining a little further. In a VAE, the en- coder network is only required during the training phase, and the inference happens by sampling latent space and using the decoder to decode the sampled embeddings (Kingma & Welling, 2013). However, in a Transformer network, as visible in Figure 5, the encoder is connected to the decoder so that input embeddings serve as context through- out the autoregressive decoding process (Vaswani et al., 2017). The most interesting and maybe the most ground-breaking output from Vaswani et al.’s (2017) study was the attention mechanism, which they refer to as Scaled Dot-Product Attention. In the study, the form of attention is more broadly called self-attention, which means that the attention head computes the representation of each element in a se- quence by considering how it relates to every other element in the same sequence (Vas- wani et al., 2014). In contrast to Recurrent Neural Networks (RNN), which rely on se- quential processing, this attention method allows each position to attend independently 26 to all positions simultaneously, making the model more computationally efficient (Vas- wani et al., 2017). Figure 4. Scaled Dot-Product Attention architecture (Vaswani et al. 2017). Figure 4 visualises the architecture of Scaled Dot-Product Attention. To better under- stand the process depicted in Figure 4, it is helpful to go through it step-by-step. Figure 5 shows how input is transformed into input embeddings and enhanced with positional encoding. For each of the embeddings, the process depicted in Figure 4 is performed. The variable 𝑄 represents a query matrix that is a product of all the embeddings 𝐸- mul- tiplied by 𝑊., and similarly, 𝐾 represents a key matrix and equals 𝐸- multiplied by 𝑊/ and then matrix multiplication is calculated between 𝑄 and 𝐾 (Vaswani et al., 2017). According to Vaswani et al. (2017), the next step of scaling the result with J𝑑0 is what sets this method apart from regular Dot-product attention. In the study, Vaswani et al. (2017) talk about the theoretical complexity of additive attention and dot-product atten- tion being similar and justify the use of dot-product due to it being faster because of 27 optimised matrix multiplication calculations. However, with sufficiently large 𝑑0 the ad- ditive attention outperformed the dot-product attention, introducing the need for scal- ing to prevent the vanishing gradient problem in the softmax layer (Vaswani et al., 2017). The next step is a masking operation that is only done on the decoder side, as shown in Figure 5. During the autoregressive process, this prevents the decoder from attending to future data that otherwise would influence the present prediction (Vaswani et al., 2017). This is achieved by replacing the result of the matrix multiplication between 𝑄 and 𝐾 with negative infinity for all the connections onwards from 𝐸1, where 𝑡 is the index of the embedding currently being processed (Vaswani et al., 2017). This adjustment en- sures that during the softmax calculation, which produces the weights in the form of a probability distribution, only the desired weights are assigned a coefficient of zero, ef- fectively disregarding their influence (Vaswani et al., 2017). The full context is maintained for the encoder side, where the masking operation is skipped (Vaswani et al., 2017). The final step in the process illustrated in Figure 4 involves the matrix multiplication of the weight matrix with the value matrix 𝑉, which is the product of 𝐸- and 𝑊2. This op- eration yields ∆𝐸- , representing the direction in which the original embedding 𝐸- should be adjusted. In the Multi-Head Attention model, there are ℎ attention heads, each generating a ∆𝐸-. These matrices are concatenated to determine the unified direc- tion for adjusting the original embeddings, enhancing the model's capacity to integrate various contextual insights (Vaswani et al., 2017). The most famous transformer adaptation is probably Generative Pre-trained Trans- former (GPT), which was released in 2018 by OpenAI (Bengesi et al., 2023). It is a large language model aiming to generate human-like text, and multiple versions have been released since (Bengesi et al., 2023). Other notable adaptations of transformer architec- ture are Bidirectional Encoder Representations from Transformer (BERT) and Vision Transformer (ViT), which have produced state-of-the-art results in Natural Language Pro- cessing (NLP) and in Computer Vision (CV) tasks, respectively (Chitty-Venkata et al., 2023). 28 Figure 5. Transformer architecture (Vaswani et al. 2017). The figure consists of an en- coder on the left and a decoder on the right. 29 3 Related Work With the relatively recent increase in computational power and some breakthroughs in the adjacent field of image generation, AI-generated music has started to take bigger and bigger leaps forward. Some of the most notable papers in the context of this thesis will be discussed in detail below. Common to all these models, in the spirit of this thesis, is the use of raw audio data as input and output for the neural network, along with the incorporation of convolutions. To gain a deeper understanding of AI-generated music, it is essential to explore the key research problems that have shaped the field. Goel et al. (2022) identify three major challenges encountered by researchers when de- signing architectures to model waveforms. The first challenge is maintaining global co- herence in the modelled waveform, which requires the neural network to effectively capture long-range dependencies (Goel et al., 2022). This is supported by Dieleman et al. (2018) as they state that attaining globally coherent music has proven difficult. The second challenge discussed by Goel et al. (2022) is computational efficiency. High-fidelity audio involves orders of magnitude more input parameters than a simple image, as ex- plained in the introduction chapter. For instance, the models in Dhariwal et al.'s (2020) Jukebox were trained for two to four weeks using up to 512 GPUs, a feat achievable only by the most well-funded research initiatives even by today's standards. The third chal- lenge is sample efficiency, which is closely related to computational efficiency. According to Goel et al. (2022), sample efficiency refers to the model's ability to converge with fewer training samples, thereby enhancing overall computational efficiency. By improv- ing sample efficiency, the model can achieve effective performance with less data, re- ducing the computational resources and time required for training. 3.1 WaveNet In 2016, Google DeepMind research laboratory released WaveNet, a deep neural net- work model capable of generating music. As described in the 2016 paper by van den Oord et al., the model is autoregressive and probabilistic, meaning that each newly 30 generated audio sample is conditioned by its priors. More concretely, the model calcu- lates a conditional probability distribution from which each new output is sampled. As previously mentioned, the probability distribution is conditioned by all previous samples. All these probability distributions are joined to form the joint probability of waveform x, as observed in Equation 1. 𝑝(𝑥) = ∏ 𝑝(𝑥1 | 𝑥&, … , 𝑥1$&)- 1#& (1) To achieve this functionality, van den Oord et al. (2016) depict a stack of causal convolu- tion layers which model the conditional probability distribution. 3.1.1 Causal and Dilated Convolutions One-dimensional causal convolutions, as described by van den Oord et al. (2016), are different from standard 1D convolutions. In causal convolutions, padding of size K-1, where K is the kernel size, is applied asymmetrically to the past or left side of the se- quential input data. This practice prevents the model from accessing future information when predicting present data, as shown in Figure 6. Figure 6. Different types of convolutions. Blue dots depict the input layer, grey marks the hidden layers, and yellow is the output layer. WaveNet not only utilises causal convolutions but takes it one step further by imple- menting dilated causal convolutions, where the dilation doubles between layers (van 31 den Oord et al., 2016). This technique expands the model's receptive field, enabling it to capture longer temporal dependencies in audio sequences, which is vital for generating coherent and realistic sound over extended periods. Additionally, it is more computa- tionally efficient and maintains the same input and output size, which is not the case with other convolution techniques like strided convolution that can be used for capturing longer temporal structures. 3.1.2 Non-linear Quantisation Another notable technique in terms of music generation that van den Oord et al. (2016) discuss is performing a non-linear quantisation for the input signal. Their proposed method involves transforming the input signal using the μ-law companding technique visible in Equation 2, where −1 < 𝑥1 < 1 and 𝜇 = 255, and quantising the data into 256 possible values. As a result, the data has a higher resolution on lower amplitudes due to its logarithmic nature (ITU-T, 1988). 𝑓(𝑥1) = 𝑠𝑖𝑔𝑛(𝑥1) 34(& 7 8|*!|) 34(& 7 8) (2) To tie it together, van den Oord et al. (2016) utilise a softmax distribution to predict the likelihood of each of the 256 quantised values. Based on this, the most probable value can be selected. Alternatively, to increase variety, the distribution can be sampled. The benefit of quantisation is clear: It simplifies the representation of audio samples, reduc- ing the data complexity from 16-bit to 8-bit per sample, thereby decreasing the softmax distribution's output space and computational cost. In contrast to previous studies, van den Oord et al. (2016) decided to use a gated activa- tion function instead of a rectified linear activation function as it produced better results when modelling audio signals. Their gated activation unit consists of two adjacent con- volutional layers, filter and gate, with respective tanh and sigmoid activation functions. 32 In addition, the units contain residual and skip connections, making the model conver- gence faster and allowing the use of deeper networks. 3.2 Jukebox The next big leap in AI-generated music came in 2020 when OpenAI researchers Dhari- wal et al. released a Jukebox model. The research listed its achievements as generating multiple-minute songs that stay coherent and contain singing. The actual model consists of three layered VQ-VAEs, Vector Quantized Variational Autoencoders, which learn to encode the music into embeddings. For inference, Dhariwal et al. (2020) used auto- regressive Scalable Transformers. When generating music with the model, it is possible to prime it with text or audio data. 3.2.1 Multi-scale VQ-VAE The Jukebox model comprises three VQ-VAEs operating on different temporal scales. Each VQ-VAE is trained separately to prevent the model from relying solely on the one that learns the highest temporal resolution (Dhariwal et al., 2020). Each VQ-VAE utilises WaveNet-like 1D convolutions mirrored for the encoder and decoder, intending to in- crease the model's receptive field (Dhariwal et al., 2020). In Jukebox, Dhariwal et al. (2020) describe the quantisation process of a one-dimensional VQ-VAE where an input signal 𝑥 = {𝑥&, … , 𝑥-} is learnt to encode with 𝑧 = {𝑧&, … , 𝑧;} indices, where 𝑇 is the input length, 𝑆 is the count of indices, and 𝑇/𝑆 is the hop length, indicating the level of dimension reduction. During the encoding process, 𝑥 is trans- formed into latent vectors ℎ = {ℎ&, … , ℎ;} , and each latent vector is mapped to the nearest codebook vector 𝑒<" ∈ 𝐶 = {𝑒&, … , 𝑒/} ,where 𝐾 is codebook size. This results in the discretisation of the latent space. 33 For the training, Dhariwal et al. (2020) use a loss sum that consists of three separate loss functions. The first one is trivial reconstruction loss calculated between 𝑥 and decoded codebook vector sequence as described in Equation 3. ℒ=>?@'A1=B?1%@' = & - ∑ ‖𝑥1 − 𝑥1‖((1 (3) The other two loss functions are called codebook loss and commit loss. Both utilise the stop gradient (sg) function, which essentially means that in backpropagation, the sg vec- tor is locked in place as the gradient is set to zero. The codebook loss penalises the model when the codebook vector 𝑒<" is far from the encoded vector ℎA and in commit loss, it is the other way around, and the model is penalised if the encoded vector ℎA is far from the codebook vector 𝑒<". The respective loss functions are detailed in Equations 4 and 5 (Dhariwal et al., 2020) ℒ?@C>D@@0 = & ; ∑ ^𝑠𝑔[ℎA] − 𝑒<"^( ( A (4) ℒ?@EE%1 = & ; ∑ ^ℎA − 𝑠𝑔[𝑒<"]^( ( A (5) In combination, these loss functions produce the total loss function for the model de- tailed in Equation 6. Dhariwal et al. (2020) state that ℒ?@EE%1 is added so that the model would try to constrain the values of ℎA closer to possible 𝑒<" values and is said to have a stabilizing effect. The weight of commit loss can be controlled with the 𝛽 value. ℒ = ℒ=>?@'A1=?B1%@' + ℒ?@C>D@@0 + 𝛽ℒ?@EE%1 (6) In the paper, Dhariwal et al. (2020) discuss a problem in which the model only learns to reconstruct low frequencies when using sample-level reconstruction loss. The proposed solution uses the Short-Time Fourier Transform (STFT), which helps the model to learn 34 mid-to-high frequencies. The resulting loss is called spectral loss and is defined in Equa- tion 7. ℒAF>?1="G = ‖𝑆𝑇𝐹𝑇(𝑥1) − 𝑆𝑇𝐹𝑇(𝑥1)‖( (7) VQ-VAEs have some known problems, one of which is codebook collapse. This means that most of ℎA get mapped to only a few of 𝑒<". To mitigate this problem, Dhariwal et al. (2020) introduce random restarts, which means that if the average usage of an embed- ding is too low the 𝑒<" vector gets randomly replaced by one of the ℎA vectors. 3.2.2 Scalable Transformers and Upsampling In addition to the VQ-VAE neural networks, Jukebox consists of autoregressive trans- formers for the actual music inference. In the paper, these transformers are divided into a top-level prior and upsamplers called middle and bottom. The prior 𝑝(𝑧) is a joint con- ditional probability distribution, with each component similar to Equation 1. The com- plete distribution is defined in Equation 8 (Dhariwal et al., 2020). 𝑝(𝑧) = 𝑝c𝑧1@Fd𝑝c𝑧E%CCG>e𝑧1@Fd𝑝(𝑧D@11@E|𝑧E%CCG> , 𝑧1@F) (8) The approach chosen by Dhariwal et al. (2020) simplifies the prediction task, as it occurs in a discrete space and can be categorised as a classification problem instead of a regres- sion problem. Basically, the transformers predict codebook indices autoregressively. In the model, inference with transformers happens top-down, and each prediction layer consists of the same number of discrete codes that map to shorter and shorter segments in the raw audio domain. Upsamplers are conditioned only with the previous layer’s dis- crete codes that match the raw audio length of the current layer. These codes go through a conditioning layer that uses dilated convolutions, similar to WaveNet, leading to an increase in the number of discrete codes after each layer, recreating the lost information and resulting in increased audio resolution (Dhariwal et al., 2020). 35 To decrease entropy, Dhariwal et al. (2020) encode artist and genre to embedding vec- tors. This can also be used to direct the model during the generation phase. They also provide the model with the total length of the audio signal and the start and end times of the current segment. 3.3 Summary Both studies claimed state-of-the-art performance and results at the time of release. Even if this is true, it doesn't necessarily provide a complete understanding of the mod- els' actual capabilities. Therefore, it is valuable to analyze both models in light of the challenges highlighted by Goel et al. (2022). One general concern that can be made from this line of study is the lack/difficulty of directing the generation in the wanted direction. Compared to other models that use prompt-based sequence-to-sequence generation like MusicLM (Agostinelli et al., 2023), it can be more difficult to direct the generation in the desired direction, especially into novel directions when existing audio does not exist, and hence it cannot be used as a prior. Also, it is important to note that although Wave- Net possesses music generation capabilities and those are talked about in the study, the main focus of the study was on text-to-speech generation. The first challenge brought up by Goel et al. (2022) is the requirement for global coherent generation. Jukebox’s approach clearly enables it to obtain some longer-range structure and maintain it throughout the generated sample, but at the cost of the audio's fidelity. This observation is aligned with previous research, as Dieleman et al. (2018) noted that using hierarchal autoregressive inference led to improved long-range structure and de- creased signal quality, indicating a trade-off. Similarly, WaveNet's application to music generation demonstrated the importance of a large receptive field to produce samples that sounded musical (van den Oord et al., 2016). Despite this, the models struggled with long-range consistency, resulting in second-to-second variations in genre, instrumenta- tion, and volume (van den Oord et al., 2016). However, the generated samples were of- ten harmonic and aesthetically pleasing, particularly when conditional models were 36 used to control specific aspects of the output based on tags like genre or instruments (van den Oord et al., 2016). The assessment of Jukebox compliance with the second point raised by Goel et al. (2022) reveals that computational efficiency is still a problem when training neural networks to model audio waveforms. In their paper, Dhariwal et al. (2020) mention four different training events: the two upsampler networks were trained with 128 GPUs for 2 weeks, top-level prior training took 4 weeks with 512 GPUs, and lastly, the lyrics conditioning training they performed with 128 GPUs for a total of 2 weeks. These numbers highlight significant monetary and time constraints for conducting research on this subject. The cost estimate for the GPUs alone amounts to millions of dollars, approximately 6.4 mil- lion USD, and does not include other necessary hardware expenses for building a system capable of training these networks. The computational demands of using the original WaveNet model are significant due to its autoregressive nature, which requires generating audio one sample at a time (van den Oord et al., 2016). While the original WaveNet paper by van den Oord et al. (2016) does not discuss the hardware or training durations in detail, additional insights can be obtained from the 2017 paper "Parallel WaveNet: Fast High-Fidelity Speech Synthesis" by van den Oord et al. (2017). In this paper, the authors state that the re-engineered WaveNet's inference was 1,000 times faster than the original, with the ability to generate one second of audio in just 50 milliseconds (van den Oord et al., 2017). This implies that the original model took approximately 50 seconds to generate one second of audio. Based on the fact that generating one second of audio took 50 seconds and that further research was conducted to improve this, it can be deduced that even though the training of the original WaveNet was efficient, the total time it took to use it made it computa- tionally inefficient, not overcoming the second challenge defined by Goel et al. (2022). The last challenge mentioned by Goel et al. (2022) is sample efficiency, closely inter- twined with computational efficiency as both affect the model’s performance. Sample 37 efficiency aims to achieve better performance through inductive biases (Goel et al., 2022). This means that the model’s learning can be enhanced by making the right design and architectural choices (Hüllermeier et al., 2013, p. 1018). In the case of WaveNet, an example is its autoregressive inference, which assumes that each new sample depends on the previous ones (van den Oord et al., 2016). For humans, it is intuitive that in music, each note depends on the previous ones, but for machines, this connection can be chal- lenging to learn without the aid of inductive bias. Similarly, Jukebox employs a hierar- chical VQ-VAE architecture, which introduces inductive biases by modelling music at multiple levels of abstraction. This approach allows Jukebox to efficiently capture both the long-term structure and the fine details of music, enhancing its sample efficiency (Dhariwal et al., 2020). 38 4 Methodology In this thesis, the chosen methodological approach is a controlled experiment. As the term implies, the objective is to establish a controlled environment where the experi- ment can be conducted (Järvinen, 2018; Walliman, 2010, p. 11). The fundamental pur- pose of experimental research is to establish causality by carefully controlling and ma- nipulating variables; researchers can isolate specific factors and observe their direct im- pact on outcomes (Järvinen, 2018; Walliman, 2010, p. 103). This approach is used to test hypotheses, validate theories, and contribute to the body of knowledge in a systematic and replicable manner that allows making informed decisions based on empirical evi- dence. In Järvinen's 2018 study On Research Methods, two critical factors for the new knowledge are identified: the necessity for rich and applicable knowledge and the need for reliable knowledge. These requirements often conflict. Specifically, when designing research, imposing numerous constraints to enhance reliability can strip the experi- mental context of factors that connect it to real-world conditions (Järvinen, 2018). Con- sequently, achieving a balance between these two factors is crucial to obtaining valuable insights from an experiment (Järvinen, 2018). The basic terminology of controlled experiments includes dependent variables, inde- pendent variables, and intervening variables (Järvinen, 2018). The dependent variable(s), as defined by Järvinen (2018), represent the quantitative outcomes of the study, meas- uring aspects that indicate improvement or deterioration, a definition corroborated by Walliman (2010, p. 11). The independent variable(s) are the parameters controlled by the researcher, with the premise that manipulating these independent variables should result in a measurable change in the dependent variable(s) attributable to the independ- ent variable(s) (Järvinen, 2018; Walliman, 2010, p. 11). The intervening variable(s) are those that are not under the control of the researcher but still affect the dependent var- iable and hence cannot be excluded from the research (Järvinen, 2018). The variables 39 that do not fall into these categories are commonly referred to as unknown variables, which are assumed not to have an impact on the research (Järvinen, 2018). Experimental research can be enhanced with a control experiment in addition to the main experiment to increase the certainty that these unknown variables do not affect the result (Järvinen, 2018). This is done to rule out that arriving at the result truly was caused by the manipulation of independent variables and not by other unknown varia- bles or other factors (Järvinen, 2018). Two additional criteria, listed by Järvinen (2018), that can be evaluated to gain more proof of the causality are association or relationship and temporal precedence. The association or relationship evaluation relies on proving the existence of covariance between independent and dependent variables, i.e., change in the independent variable shows reliable and observable change in the dependent var- iable (Järvinen, 2018). Proving the temporal precedence of events means that in order for two things to be causally linked, the change in the dependent variable must always happen after the change in the independent variable (Järvinen, 2018). The practical implementation of this research involved conducting a controlled experi- ment to test whether a convolutional architecture could improve an autoencoder’s abil- ity to learn audio data representation. The process began with building a baseline model—a fully connected autoencoder with sufficient performance for comparison. Fol- lowing this, a convolutional autoencoder was created to evaluate the effects of convolu- tional layers on the model’s performance. Both models were trained from zero until con- vergence, and the results were analyzed to assess differences in performance, model size, and output quality. 40 5 Experiment Design and Model Evolution This research is focused on testing a hypothesis through a controlled experiment. It in- volves comparing two models to determine if convolution can improve the model's abil- ity to learn audio data representation. The first step was to create a prototype model with a decent baseline performance, against which the convolution model could be com- pared. The next step was to create a convolution model and, thirdly, analyse the differ- ences in performance, size, and quality of the result. The initial idea was to study the effects of multidimensional convolution on AI-generated music. The prototyping phase turned out to be the most time-consuming part of the thesis work. Due to time constraints, the research scope had to be adjusted as the pro- totyping phase continued. The experiment also faced constraints imposed by the system used for testing, which led to the decision to omit the generative functionality from the neural network due to these limitations. As previously described, the reduced scope focuses on the performance differences between a fully connected Autoencoder net- work and a 1D convolution Autoencoder network. The initial plan was to acquire a raw audio dataset and do minimal preprocessing before feeding that raw audio data into a generative neural network. After the initial model was ready, the goal was to play with the input data’s dimensionality and convolutions to see if that would improve the consistency of the structure in the long term. In the planning phase, a Variational Autoencoder was selected as the generative neural network type. The initial model was built, but during the training, the validation metrics showed that the model could not learn a meaningful representation of the data. This sparked the prototyping process that followed a continuous feedback loop with three key phases. First, the model underwent training and testing. Next, the results were analysed. Finally, based on these insights, the model and/or preprocessing techniques were refined and updated, leading back to the next iteration of training and testing. Numerous prepro- cessing methods were tried during the prototyping process, such as μ-law companding and quantisation, as explained by van den Oord et al. in 2016. Throughout the process, 41 the original model structure was adjusted to account for changes in the dimensionality of the input data. Initially, the 1D sequential audio data was converted into a 2D image- like format. Subsequently, it was further transformed into a 3D video-like format during the following prototyping iterations. As previously mentioned, due to the inability to construct a model capable of learning meaningful data representations within the con- straints, the decision was made to use an Autoencoder neural network instead of a Var- iational Autoencoder network. 5.1 Dataset and System This research was performed on Faraldo’s (2017) Beatport EDM Key Dataset, which con- sists of 1486 songs in the Electronic Dance Music (EDM) genre. The data was divided into training, validation, and test sets, so the training set consisted of 1300 songs, the valida- tion set had 130 songs, and the test set included 20 songs. These songs were split into segments. Different segment lengths were tested, but a 3-second segment length was selected. The audio data has a frequency of 44100 Hz, meaning there are 44.1 thousand data points per second. To make this manageable, data quality was down-sampled by a factor of ten, resulting in 4410 data points per second. Combined with the selected 3- second segment length, this results in a 13230 input length in the neural network. As previously referenced, the importance of computational resources cannot be over- emphasised. The models' prototyping and actual experiments were performed on a PC. The PC had an Intel i7-4770K processor and 24 GB of DDR3 RAM. Neural network training was performed on a dedicated GPU, Nvidia GeForce GTX 1080, which has 8 GB of dedi- cated VRAM and an additional 12 GB of shared GPU memory, for a total of 20 GB of memory. The bottleneck in the process turned out to be the GPU memory. The main limitation of GPU memory is its impact on the size of the neural network it can handle. In this context, size is closely related to complexity, as increasing the number of layers or the size of each layer in the neural network increases its memory usage. When proto- typing, I found that a model with three fully connected layers containing 9800, 6500, and 3300 neurons resulted in a size of over 10 GB when loaded into the GPU memory. 42 5.2 Model Evolution This research resulted in the development of two neural networks, each designed to capture specific representations of audio data. Additionally, it outlines the preprocessing steps necessary for the success of these neural networks. As discussed earlier, the model development process followed a prototyping approach, which created a feedback loop. This iterative process led to several unsuccessful combinations of preprocessing tech- niques and generative neural network models. As the project timeline became more constrained, a strategic decision was made to nar- row the scope by excluding the generative neural network. Following this decision, many previously developed preprocessing methods were reevaluated using a standard Auto- encoder instead of a Variational Autoencoder. However, the performance outcomes were unsatisfactory, suggesting that the primary limiting factor in this study may have been the system's computational capacity in relation to network size. 5.2.1 Preprocessing The preprocessing in this study can be divided into two categories: non-transformative and transformative methods. Non-transformative methods modify the data’s dimen- sions without altering its internal structure. Examples of such preprocessing steps in- clude data segmentation and sampling, which were applied consistently throughout the prototyping process. In the sampling step, the original 44.1 kHz signal was downsampled to 4.41 kHz, reducing the number of data points by a factor of 10. Similarly, in the data segmentation step, the downsampled 2-minute-long songs were divided into more man- ageable 3-second chunks. These methods were introduced primarily to reduce the com- putational load on the system, as processing large datasets in their original form would have been resource-intensive. Other preprocessing methods that fall into the first cate- gory and align with the study’s initial plan were modifications to the input data dimen- sions. 43 In contrast, transformative methods involve altering the data’s internal relationships or converting it from one coordinate system to another. Initially, the plan did not include any data-transforming preprocessing, as the goal was to work with the raw data as much as possible. However, when it became evident that the neural networks could not effec- tively learn from the raw audio data, transformative preprocessing steps were added to the workflow. These transformations aimed to modify the data to better align with the neural network’s learning capabilities. Various transformative preprocessing techniques were tested individually and in combination as part of the iterative prototyping process. Early attempts at transformative preprocessing involved data normalisation, which is beneficial in some instances, as pointed out by Singh and Singh (2019). Min-max normal- isation was selected, with a range of [-1, 1], to align with the sigmoid activation function's output range used by the model at that time. Despite this alignment, no noticeable im- provement was observed in the behaviour of the neural networks. def mu_law_companded(x, mu=255): # Ensure the input is in the range [-1, 1] x = np.clip(x, -1, 1) # Apply µ-law companding x_mu = np.sign(x) * (np.log1p(mu * np.abs(x)) / np.log1p(mu)) return x_mu Following preprocessing attempts, utilised μ-law companding, as described by van den Oord et al. (2016). The code snippet above demonstrates how the transformation was applied, where the input data was first clipped to the range of [-1, 1] before using the μ- law companding transformation. Clipping was necessary for this process, as indicated in Equation 2. When examining Waveform 1 in Figure 7, it is clear that most amplitudes remained within the range of [-1, 1], making clipping an effective solution. Listening tests comparing Waveform 1 and Waveform 3 revealed no significant auditory differences, even though clipping resulted in minimal data loss. This indicates that clipping did not significantly affect the quality of the processed audio. 44 def mu_law_decoding(y, mu=255): # Apply inverse µ-law decoding x = np.sign(y) * (1.0 / mu) * (np.power(1 + mu, np.abs(y)) - 1) return x The above code snippet depicts how μ-law companding transformation was reversed, and from Figure 7, the effects of μ-law companding on a waveform can be observed. The first plot shows Waveform 1, the validation waveform, without any modifications. The second plot displays Waveform 1 after applying μ-law companding using the function visible in the code snippet above. The third plot exhibits Waveform 2 after reversing the μ-law companding using the function in the code snippet below. However, feeding the neural network with μ-law companded data did not lead to any performance improve- ments. 45 Figure 7. Effects of μ-law companding on a waveform. Van den Oord et al. (2016) utilised μ-law companding along with quantisation, prompt- ing the consideration of incorporating this technique in the preprocessing phase. This method involves transforming the problem from a regression to a classification problem. 46 However, as this was not the chosen preprocessing method for this study, even though providing a detailed explanation would be interesting, it would also divert focus from the current study. Therefore, it is only provided here for context. The preprocessing method that ultimately proved effective involved a combination of steps. First, non-transformative techniques such as downsampling and segmentation were applied. Each 3-second segment was then transformed using a Real Fast Fourier Transform (RFFT), converting the data into the frequency and magnitude domain. After the RFFT transformation, the length of each 3-second segment was halved, with the re- sulting array containing values in the form of real ± imaginary coefficients. In the final step, the array size was doubled back to its original length by splitting each value into separate real and imaginary coefficient components. The final array alternated between real and imaginary coefficients, structured as [real_value, img_coeff, real_value, img_co- eff, …], making the data suitable for input into the neural network. It's interesting to note that the length of the segment, when combined with the RFFT, impacts the neural net- works' capacity to learn. When the process of segmenting and RFFT was reversed so that the complete 2-minute song was taken through the RFFT and only then segmented into “3-second” segments, it led to the network's inability to capture a meaningful represen- tation of the data. 5.2.2 Model development This study produced two distinct Autoencoder neural network architectures: the base model and the convolutional model. The base model was created as a reference point for comparing the performance of the convolutional model. The base model consists of 5 fully connected layers, similar to the model shown in Figure 4. It comprises an encoder side (depicted in blue in Figure 4) and a decoder side (de- picted in green in Figure 4). In the middle, there is a bottleneck layer that determines the minimum dimensionality through which the data is passed. This layer can signifi- cantly impact the model's ability to learn a representation of the data, as having too few 47 nodes can make it impossible to capture all the important attributes of the data. In a network layer, there are input and output nodes, which are similar to the two bottom- most layers shown in Figure 1. The connections are as depicted in Figure 1, where each node is connected to all consecutive nodes; hence, the name is fully connected. The preprocessing creates data blocks the size of 13230 attributes; this determines the first layer input size. The two layers in the encoder reduce the attributes first to 9800 and then 6500. The bottleneck layer, also known as latent space/vector, further reduces the dimensionality to 3300 attributes. The decoder mirrors the encoder and upscales the data back to 13230 in 2 layers. This model uses a Rectified Linear Unit activation function (ReLU) where each layer’s output is pushed through the activation function except for the decoder’s final layer. The convolutional model, as described by Goodfellow et al. (2016, p. 326), is a type of neural network where one or more layers are replaced with convolutional layers. In this research, the model consists of 8 layers, with six being convolutional and two being lin- ear (fully connected). The model design is shown in Figure 8 and Figure 9. Figure 8 illus- trates the encoder side of the Autoencoder design, with each block representing the interface between layers. The numbers above the blocks represent the channels in each interface. For example, the first layer takes data in one channel, but after the first con- volution, the data is channelled into 16 separate channels. This increase in the number of channels is represented by the increased thickness of the blocks in the figures. Addi- tionally, each convolution reduces the input length for the next layer, which is visualised by the decrease in the "area" of each block. 48 Figure 8. The encoder part of the Convolutional Autoencoder is in blue, and the first latent vector is in white. The code snippet below illustrates how the input length decreases as it passes through the encoder side. In the code, 40 represents the batch size, while 16, 32, and 64 repre- sent the channel counts, and the last number represents the input length. An interesting observation from the code is that doubling the channel size halves the input length. This is because of the kernel attribute selection for the convolutions. Each convolution layer uses the same kernel attributes, which are size: 4, stride: 2, and padding: 1. The "flatten" operation is essential for transforming the 2D data back to a 1D form, allowing it to be fed to the linear bottleneck layer. Encoder input: torch.Size([40, 1, 13230]) Conv1: torch.Size([40, 16, 6615]) Conv2: torch.Size([40, 32, 3307]) Conv3: torch.Size([40, 64, 1653]) Flatten: torch.Size([40, 105792]) Bottleneck: torch.Size([40, 3300]) In Figure 9, we can see that the decoder side of the autoencoder mirrors the encoder side. One key difference is that the decoder's convolutional layers use transpose 49 convolution operations to reverse the effects of the convolution operations. The input length is doubled when the channels are halved in the decoder. However, this alone is insufficient to reach the original data dimensionality, as shown in the code snippet below. To address this mismatch, the decoder side uses a fourth kernel attribute called output padding, which increases the dimensions of the output by one. Similar to the base model, all the layers in the autoencoder go through the ReLU activation function except for the decoder's last layer. Decoder input: torch.Size([40, 3300]) Decoder fc: torch.Size([40, 105792]) Reshape: torch.Size([40, 64, 1653]) Conv3: torch.Size([40, 32, 3307]) Conv2: torch.Size([40, 16, 6615]) Conv1: torch.Size([40, 1, 13230]) Figure 9. The decoder part of the Convolutional Autoencoder is in green, and the second latent vector is in white. 5.2.3 Testing setup As previously established, this test evaluates the difference between linear and convo- lutional neural network architectures. Both networks’ training starts from zero and is set 50 to last until convergence or until a max epoch limit of 500 is reached. In this study, con- vergence is considered to be achieved when there are 50 epochs with no improvement in validation loss. The testing setup consists of multiple variables, which in the controller experiment world can fall under the category of either unknown variables or intervening variables. To limit the number of variables that fall into either of the categories, all other variables are kept constant between the test runs. These variables include system set- tings, dataset composition, and training setup. Chapter 5.1 discusses system settings and the dataset in detail, so this section elaborates on the testing setup. Two parts that critically affect the model's ability to learn are the objective function and optimization procedure, as explained in Chapter 2. In this study, the Mean Squared Error (MSE) function, presented in Equation 3, is employed as the objective function in align- ment with established literature, given the regression nature of the problem (James et al., 2023, p. 28). Building upon the optimization concepts discussed in Chapter 2, this study utilizes the Adam optimizer for training the neural network model. Adam, short for Adaptive Moment Estimation, is an extension of stochastic gradient descent that computes adaptive learning rates for each parameter (Kingma & Ba, 2015). The choice of Adam is motivated by its efficiency and effectiveness in handling sparse gradients and noisy data, which are common in real-world datasets. The optimizer is configured with a learning rate of 1𝑥10$H and a weight decay of 1𝑥10$I. A lower learning rate ensures that the model converges smoothly without overshooting the minimum. The weight de- cay term is a regularization parameter, penalizing large weights to prevent overfitting and improve the model's generalization capabilities (Goodfellow et al., 2016, p. 229). 5.3 Results Following the principles of a controlled experiment, this study established one independ- ent variable and seven dependent variables. The independent variable in this study can be broadly described as the neural networks’ architecture. In controlled experiments, dependent variables are those expected to be affected by changes in independent vari- ables. They can hence be used to measure results if it is also accepted that a change in 51 the dependent variable leads to a qualitative improvement in output data. The depend- ent variables can be divided into four error calculations, two graphical visualizations and one subjective listening review. The selected error metrics were Mean Squared Error, Mean Absolute Error, Root Mean Squared Error, and Signal-to-Noise Ratio (SNR). As mentioned earlier, Mean Squared Er- ror (MSE) was chosen as the validation loss function, given its effectiveness in regression tasks. In addition, Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) were computed to give further insights into model performance. MAE assesses the av- erage size of the errors in predictions, providing a clear view of how far predictions de- viate from actual values in the original scale, making it easier to interpret. RMSE, which is derived by taking the square root of MSE, is more responsive to larger errors, giving insight into possible outliers in predictions. Signal-to-Noise Ratio was also employed as a metric to measure the clarity of the reconstructed audio compared to the original. SNR assesses the level of the desired signal relative to the background noise, with higher val- ues indicating better reconstructions. Compared to other error metrics, SNR provides a complementary view of model performance and is particularly suitable for audio pro- cessing tasks. The validation process utilized two types of graphs: comparison graphs illustrating both the original and reconstructed waveforms and magnitude spectrum graphs comparing the frequency content of the validation and reconstructed data. Ulti- mately, the most sensible way to evaluate the music's results is through listening. The listening and the analysis were both conducted by the author. 𝑀𝐴𝐸 = ∑ |KL#$K#| $ # ' (9) 𝑅𝑀𝑆𝐸 = i& ' ∑ (𝑦% − 𝑦6%)(' % (10) This study includes two training runs, one for each model, allowing them to be trained from scratch to assess their performance. An analysis of the results is conducted after 52 both training runs are complete. The values of the dependent variable were automati- cally recorded at the beginning of the training and during training under two specific conditions: every 25th epoch and when the best validation loss improved. Figure 10 il- lustrates the validation error progression for both models. Figure 10 gives an excellent general view of the performance of both models. It can be seen that the convolution model was worse in the beginning, but in a matter of a few epochs, it was able to surpass the base model. Also interesting is that the base model reached its minima in epoch 56 and slowly started increasing after that. After that, the convolution model’s validation loss continued to decline up to 75 epochs and remained very close to the global minima. It is important to highlight that the graph's resolution is highest during the initial epochs and diminishes toward the end of the runs, based on the chosen reporting criteria. The loss is recorded only for every 25th epoch, provided it does not improve the validation loss. Figure 10. Evolution of the validation loss during the test with global minima. The blue curve represents the base model’s performance, and the red curve represents the convolution model’s performance. 53 Tables 1 and 2 give a deeper insight into the development of dependent variables. It should be noted that epoch numbering starts at zero, and at that point, one training round has already been completed. In the tables, the first three rows reflect the progress of models after one, five, and ten epochs. The fourth row presents each model's best result, while the remaining rows display all recorded outcomes following those best re- sults. It is worth highlighting some notable findings from the results. Initially, the convo- lution model had a higher validation loss compared to the base model. However, after just five epochs, the convolution model had halved its validation loss, whereas the base model only achieved a 30% reduction. Remarkably, it took only ten epochs for the con- volution model to surpass the base model's lowest validation loss. By the 10th epoch, the base model reached results similar to what the convolution model had achieved in just half that number of epochs. Table 1. Summary of changes in the base model's dependent variables. Base model MSE MAE RMSE SNR Epoch 0 307.26 0.08 16.75 3.18 dB Epoch 4 202.67 0.06 13.44 5.61 dB Epoch 9 165.03 0.06 12.26 6.57 dB Epoch 56 124.20 0.05 11.09 8.42 dB Epoch 75 126.90 0.05 11.25 8.37 dB Epoch 100 128.29 0.05 11.37 8.42 dB Epoch 125 131.28 0.05 11.54 8.35 dB 54 Table 2. Summary of changes in the convolution model's dependent variables. Base model MSE MAE RMSE SNR Epoch 0 342.58 0.11 17.58 1.64 dB Epoch 4 167.56 0.06 12.39 6.32 dB Epoch 9 120.48 0.05 10.62 8.55 dB Epoch 75 91.05 0.04 9.69 10.23 dB Epoch 100 92.57 0.04 9.77 10.16 dB Epoch 125 91.91 0.04 9.74 10.27 dB The following Figures 11 to 20 offer an overview of the training processes for both models. These figures are divided into two groups: Figures 11 to 15 showcase amplitude waveform data, while Figures 16 to 20 illustrate the magnitude across frequency bins. Each figure represents an interesting stage in model progression, providing visual benchmarks of how closely each model approximates the target waveform and frequency distribution over successive epochs. 55 Figure 11. Waveform of the validation audio. Figure 12. The base model’s reconstruction data is depicted as a waveform after Epoch 0. Figure 13. The base model’s reconstruction data is depicted as a waveform after Epoch 56. 56 Figure 14. The convolution model’s reconstruction data is depicted as a waveform after Epoch 0. Figure 15. The convolution model’s reconstruction data is depicted as a waveform after Epoch 75. Figure 11 presents a waveform representation of the validation data, which serves as the target reference. Figures 12 and 13 illustrate the initial (Epoch 0) and optimal (Epoch 56) results for the base model. As the model progresses from Figure 12 to Figure 13, notable changes in the waveform appear particularly the development of distinct amplitude spikes that move closer to the target shape in Figure 11. Figures 14 and 15 similarly track the convolution model’s progression, with Figure 14 displaying the initial output (Epoch 0) and Figure 15 presenting the optimal result (Epoch 75). Figure 14 exhibits more wave- form volatility compared to Figure 12 but less than Figure 13. Interestingly, the convolu- tional model's mean squared error (MSE) at Epoch 0 was worse than that of the base 57 model, although visually, the waveform in Figure 14 resembles the target validation data in Figure 11 more closely than the base model's Figure 12. Figure 15 showcases the con- volutional model’s best result achieved after Epoch 75, and compared to earlier figures, it most closely resembles the target waveform in Figure 11. In comparison to the base model's best result shown in Figure 13, the individual amplitude spikes in Figure 15 are slightly stronger, making it more similar to Figure 11. However, Figure 15 is not perfect, lacking some of the mid-level amplitudes between weak and strong that are present in Figure 11. Figure 16. This represents the frequency distribution and corresponding magnitudes of the validation data. Figure 17. Frequency distribution and magnitude of the base model's reconstruction data after Epoch 0. 58 Figure 18. Frequency distribution and magnitude of the base model's reconstruction data after Epoch 56. Figure 19. Frequency distribution and magnitude of the convolution model's reconstruc- tion data after Epoch 0. Figure 20. Frequency distribution and magnitude of the convolution model's reconstruc- tion data after Epoch 75. 59 Figure 16 illustrates the frequency distribution and corresponding magnitudes of the val- idation data, essentially showing how prominent each frequency range is within the tar- get data. Figure 17, representing the base model's initial reconstruction at Epoch 0, shows only a single major frequency spike; unlike the broader distribution seen in Figure 16, the mass in Figure 17 is concentrated in a narrow range. By Epoch 56, shown in Figure 18, the base model’s frequency distribution widens, closely resembling Figure 16, but still missing smaller spikes at frequencies farther from the primary concentration. Fig- ures 19 and 20 show the convolutional model's progression, starting with Epoch 0 in Figure 19, where the distribution already replicates much of Figure 16, with a wider base and even a secondary spike. However, this figure exhibits a cutoff beyond the 50kth fre- quency bin, with no data present past that point. Figure 20, showing the convolutional model’s best result at Epoch 75, closely resembles Figure 16 with an even sharper cutoff than in Figure 19, beyond which no frequency data appears. 5.4 Analysis This chapter focuses on analysing the models' results, considering the reasoning behind their performance and the factors that influenced their behaviour. The analysis can be divided into two sub-chapters: common features and separating features. Given the highly controlled environment and specific test conditions, one might expect that the models would share more commonalities than differences. Surprisingly, that is not the case. 5.4.1 Common features Exploring audio data reconstruction of the models revealed a noteworthy commonality that deserves further examination. One notable commonality between both neural net- work models is the absence of sounds typically associated with higher frequencies—such as melodies or vocals—from the reconstructed audio. This observation raises the ques- tion of why these higher-frequency elements were not effectively retained by either 60 model, particularly focusing on the potential role of downsampling and its compliance with the Nyquist theorem. A potential explanation for the lack of melodies could be the effects of downsampling. The Nyquist-Shannon sampling theorem states that the highest frequency that can be accurately captured by a system is half of the sampling rate (Shannon, 1949). For the audio used in this study, which was downsampled to 4410 Hz, the highest frequency that can be reconstructed without aliasing is 2205 Hz. However, this explanation alone seems insufficient, as most musical instruments produce fundamental frequencies well below this limit. To contextualize the discussion, it's beneficial to consider the frequency ranges of various musical instruments. For instance, a piano spans from 27.5 Hz to 4186 Hz (A0 to C8), a violin from 196 Hz to approximately 3500 Hz, and a guitar from 82 Hz to up to 1k Hz. The human voice typically operates at around 120 Hz for males and approximately 200 Hz for females, though female singers can reach frequencies as high as 1500 Hz (Pulkki & Kar- jalainen, 2015, p. 82). These examples il