SPOTIFY APP DATA ANALYSIS AND RECOMMENDATION SYSTEM DEVELOPMENT

Ahmed, Mian Shayan

SPOTIFY APP DATA ANALYSIS AND RECOMMENDATION SYSTEM DEVELOPMENT

dc.contributor.author	Ahmed, Mian Shayan
dc.contributor.faculty	fi=Tekniikan ja innovaatiojohtamisen yksikkö\|en=School of Technology and Innovations\|
dc.contributor.organization	fi=Vaasan yliopisto\|en=University of Vaasa\|
dc.date.accessioned	2026-06-08T13:44:20Z
dc.date.issued	2026-05-14
dc.description.abstract	Abstract: This thesis presents the design, implementation, and evaluation of a hybrid music recommendation system that integrates Collaborative Filtering (CF) with audio-based content features. The system was designed and implemented using a synthetic dataset generated with the same schema as the Last.fm 1K Users dataset and Spotify Web API audio features; all quantitative results reported in this thesis are based on synthetic data, and no empirical validation on real users has been conducted. Obtaining the real Last.fm dataset and conducting empirical validation on real user data is designated as the primary direction for future work. The research was motivated by the growing challenge users face in discovering relevant music from vast streaming catalogues, and by the structural limitations of single-paradigm approaches: Collaborative Filtering suffers from data sparsity and cold-start problems, while content-based methods tend to over-specialise toward established user preferences, suppressing musical discovery. The primary objectives were to: (1) analyse and preprocess large-scale music interaction data; (2) implement ALS optimisation for matrix factorisation (SGD comparison is designated as future work); (3) engineer a 24-dimensional audio feature vector from Spotify Web API descriptors; (4) implement and compare three recommendation models — Collaborative Filtering (CF), Content-Based (CB), and a Weighted Hybrid (WH); and (5) evaluate all models using a multi-dimensional metric suite. A seven-stage Design Science Research pipeline was followed. The skip-adjusted implicit feedback signal (Equation 3.1) was constructed from synthetic interaction logs generated with the same schema as the Last.fm 1K dataset; all quantitative results are based on this synthetic data and no real user listening histories were used. The 24- dimensional audio feature matrix A was engineered from Spotify Web API descriptors comprising 7 perceptual, 4 dynamic/temporal, 12 tonal (one-hot encoded key), and 1 mode dimension. The CF model was implemented using Alternating Least Squares (ALS) matrix factorisation; the CB model used cosine similarity over the audio feature space; and the Weighted Hybrid linearly interpolated the two predictions with a cross-validated mixing parameter alpha. Technologies employed included Python 3.10, NumPy, SciPy, pandas, scikit-learn, the implicit library for ALS, Spotipy for Spotify Web API access, Streamlit for the web application, and Matplotlib/Seaborn for visualisation. Evaluation across five metrics (Precision@10, Recall@10, NDCG@10, ILD@10, Novelty@10) with 95% bootstrap confidence intervals — all conducted on synthetic data, with no empirical validation on real users — revealed a clear accuracy-diversity trade-off: the Content-Based model achieved the highest ranking accuracy (NDCG@10 = 0.0330, Precision@10 = 0.0367), while the Collaborative Filtering model achieved substantially higher intra-list diversity (ILD@10 = 0.9714) and novelty (Novelty@10 = 3.6411). Confidence intervals for NDCG@10 between CB and CF did not overlap, confirming statistical significance on the synthetic dataset. The study concludes that no single recommendation paradigm simultaneously optimises all quality dimensions, confirming the theoretical justification for hybrid architectures. It must be emphasised that all reported quantitative results are based entirely on synthetic data; no empirical validation on real users exists in this thesis. The Weighted Hybrid with its automatic cold-start fallback (alpha = 0 for new items) provides a principled solution to the item cold-start problem. A fully reproducible nine-file Python pipeline and an interactive Streamlit web application are delivered as the principal artefacts. Future work includes obtaining the real Last.fm dataset and conducting empirical validation on real user data, implementing the Feature-Augmented Hybrid architecture, and conducting a longitudinal user study. KEYWORDS: music recommendation, collaborative filtering, content-based filtering, hybrid systems, matrix factorisation, audio features, Spotify, machine learning, implicit feedback, beyond-accuracy evaluation
dc.description.notification	fi=Opinnäytetyö kokotekstinä PDF-muodossa.\|en=Thesis fulltext in PDF format.\|sv=Lärdomsprov tillgängligt som fulltext i PDF-format\|
dc.format.extent	85
dc.identifier.uri	https://osuva.uwasa.fi/handle/11111/20746
dc.identifier.urn	URN:NBN:fi-fe2026052049875
dc.language.iso	eng
dc.rights	CC BY-NC-ND 4.0
dc.subject.degreeprogramme	Master’s Programme in Computing Sciences
dc.subject.discipline	Sustainable and Autonomous Systems
dc.subject.yso	recommender systems
dc.subject.yso	users
dc.subject.yso	music
dc.subject.yso	neural networks (information technology)
dc.subject.yso	applications (computer programmes)
dc.subject.yso	data mining
dc.subject.yso	musical objects
dc.subject.yso	participatory planning
dc.subject.yso	evaluation
dc.subject.yso	live action role-playing games
dc.title	SPOTIFY APP DATA ANALYSIS AND RECOMMENDATION SYSTEM DEVELOPMENT
dc.type.ontasot	fi=Pro gradu -tutkielma\|en=Master's thesis\|sv=Pro gradu -avhandling\|

Tiedostot

Näytetään 1 - 1 / 1

Name:: Master Thesis Final by Mian 2026.pdf
Size:: 1.27 MB
Format:: Adobe Portable Document Format

Lataa

Kokoelmat

Pro gradu -tutkielmat ja diplomityöt