SPOTIFY APP DATA ANALYSIS AND RECOMMENDATION SYSTEM DEVELOPMENT

Ahmed, Mian Shayan

SPOTIFY APP DATA ANALYSIS AND RECOMMENDATION SYSTEM DEVELOPMENT

Ahmed, Mian Shayan

2026-05-14

Pro gradu -tutkielma

Sustainable and Autonomous Systems

Master Thesis Final by Mian 2026.pdf

1.27 MB - Ensisijainen

cc by-nc-nd 4.0

Lataukset40

Pysyvä osoite

https://urn.fi/URN:NBN:fi-fe2026052049875

Kuvaus

Opinnäytetyö kokotekstinä PDF-muodossa.

Abstract: This thesis presents the design, implementation, and evaluation of a hybrid music recommendation system that integrates Collaborative Filtering (CF) with audio-based content features. The system was designed and implemented using a synthetic dataset generated with the same schema as the Last.fm 1K Users dataset and Spotify Web API audio features; all quantitative results reported in this thesis are based on synthetic data, and no empirical validation on real users has been conducted. Obtaining the real Last.fm dataset and conducting empirical validation on real user data is designated as the primary direction for future work. The research was motivated by the growing challenge users face in discovering relevant music from vast streaming catalogues, and by the structural limitations of single-paradigm approaches: Collaborative Filtering suffers from data sparsity and cold-start problems, while content-based methods tend to over-specialise toward established user preferences, suppressing musical discovery. The primary objectives were to: (1) analyse and preprocess large-scale music interaction data; (2) implement ALS optimisation for matrix factorisation (SGD comparison is designated as future work); (3) engineer a 24-dimensional audio feature vector from Spotify Web API descriptors; (4) implement and compare three recommendation models — Collaborative Filtering (CF), Content-Based (CB), and a Weighted Hybrid (WH); and (5) evaluate all models using a multi-dimensional metric suite. A seven-stage Design Science Research pipeline was followed. The skip-adjusted implicit feedback signal (Equation 3.1) was constructed from synthetic interaction logs generated with the same schema as the Last.fm 1K dataset; all quantitative results are based on this synthetic data and no real user listening histories were used. The 24- dimensional audio feature matrix A was engineered from Spotify Web API descriptors comprising 7 perceptual, 4 dynamic/temporal, 12 tonal (one-hot encoded key), and 1 mode dimension. The CF model was implemented using Alternating Least Squares (ALS) matrix factorisation; the CB model used cosine similarity over the audio feature space; and the Weighted Hybrid linearly interpolated the two predictions with a cross-validated mixing parameter alpha. Technologies employed included Python 3.10, NumPy, SciPy, pandas, scikit-learn, the implicit library for ALS, Spotipy for Spotify Web API access, Streamlit for the web application, and Matplotlib/Seaborn for visualisation. Evaluation across five metrics (Precision@10, Recall@10, NDCG@10, ILD@10, Novelty@10) with 95% bootstrap confidence intervals — all conducted on synthetic data, with no empirical validation on real users — revealed a clear accuracy-diversity trade-off: the Content-Based model achieved the highest ranking accuracy (NDCG@10 = 0.0330, Precision@10 = 0.0367), while the Collaborative Filtering model achieved substantially higher intra-list diversity (ILD@10 = 0.9714) and novelty (Novelty@10 = 3.6411). Confidence intervals for NDCG@10 between CB and CF did not overlap, confirming statistical significance on the synthetic dataset. The study concludes that no single recommendation paradigm simultaneously optimises all quality dimensions, confirming the theoretical justification for hybrid architectures. It must be emphasised that all reported quantitative results are based entirely on synthetic data; no empirical validation on real users exists in this thesis. The Weighted Hybrid with its automatic cold-start fallback (alpha = 0 for new items) provides a principled solution to the item cold-start problem. A fully reproducible nine-file Python pipeline and an interactive Streamlit web application are delivered as the principal artefacts. Future work includes obtaining the real Last.fm dataset and conducting empirical validation on real user data, implementing the Feature-Augmented Hybrid architecture, and conducting a longitudinal user study. KEYWORDS: music recommendation, collaborative filtering, content-based filtering, hybrid systems, matrix factorisation, audio features, Spotify, machine learning, implicit feedback, beyond-accuracy evaluation

recommender systems users music neural networks (information technology)applications (computer programmes)data mining musical objects participatory planning evaluation live action role-playing games

Tietueen kaikki tiedot

SPOTIFY APP DATA ANALYSIS AND RECOMMENDATION SYSTEM DEVELOPMENT

Toimittaja(t)

Pysyvä osoite

Kuvaus

URI

DOI

Emojulkaisu

ISBN

ISSN

Aihealue

OKM-julkaisutyyppi

SPOTIFY APP DATA ANALYSIS AND RECOMMENDATION SYSTEM DEVELOPMENT

Toimittaja(t)

Pysyvä osoite

Kuvaus

URI

DOI

Emojulkaisu

ISBN

ISSN

Aihealue

OKM-julkaisutyyppi

Avainsanat