SPOTIFY APP DATA ANALYSIS AND RECOMMENDATION SYSTEM DEVELOPMENT
| dc.contributor.author | Ahmed, Mian Shayan | |
| dc.contributor.faculty | fi=Tekniikan ja innovaatiojohtamisen yksikkö|en=School of Technology and Innovations| | |
| dc.contributor.organization | fi=Vaasan yliopisto|en=University of Vaasa| | |
| dc.date.accessioned | 2026-06-08T13:44:20Z | |
| dc.date.issued | 2026-05-14 | |
| dc.description.abstract | Abstract: This thesis presents the design, implementation, and evaluation of a hybrid music recommendation system that integrates Collaborative Filtering (CF) with audio-based content features. The system was designed and implemented using a synthetic dataset generated with the same schema as the Last.fm 1K Users dataset and Spotify Web API audio features; all quantitative results reported in this thesis are based on synthetic data, and no empirical validation on real users has been conducted. Obtaining the real Last.fm dataset and conducting empirical validation on real user data is designated as the primary direction for future work. The research was motivated by the growing challenge users face in discovering relevant music from vast streaming catalogues, and by the structural limitations of single-paradigm approaches: Collaborative Filtering suffers from data sparsity and cold-start problems, while content-based methods tend to over-specialise toward established user preferences, suppressing musical discovery. The primary objectives were to: (1) analyse and preprocess large-scale music interaction data; (2) implement ALS optimisation for matrix factorisation (SGD comparison is designated as future work); (3) engineer a 24-dimensional audio feature vector from Spotify Web API descriptors; (4) implement and compare three recommendation models — Collaborative Filtering (CF), Content-Based (CB), and a Weighted Hybrid (WH); and (5) evaluate all models using a multi-dimensional metric suite. A seven-stage Design Science Research pipeline was followed. The skip-adjusted implicit feedback signal (Equation 3.1) was constructed from synthetic interaction logs generated with the same schema as the Last.fm 1K dataset; all quantitative results are based on this synthetic data and no real user listening histories were used. The 24- dimensional audio feature matrix A was engineered from Spotify Web API descriptors comprising 7 perceptual, 4 dynamic/temporal, 12 tonal (one-hot encoded key), and 1 mode dimension. The CF model was implemented using Alternating Least Squares (ALS) matrix factorisation; the CB model used cosine similarity over the audio feature space; and the Weighted Hybrid linearly interpolated the two predictions with a cross-validated mixing parameter alpha. Technologies employed included Python 3.10, NumPy, SciPy, pandas, scikit-learn, the implicit library for ALS, Spotipy for Spotify Web API access, Streamlit for the web application, and Matplotlib/Seaborn for visualisation. Evaluation across five metrics (Precision@10, Recall@10, NDCG@10, ILD@10, Novelty@10) with 95% bootstrap confidence intervals — all conducted on synthetic data, with no empirical validation on real users — revealed a clear accuracy-diversity trade-off: the Content-Based model achieved the highest ranking accuracy (NDCG@10 = 0.0330, Precision@10 = 0.0367), while the Collaborative Filtering model achieved substantially higher intra-list diversity (ILD@10 = 0.9714) and novelty (Novelty@10 = 3.6411). Confidence intervals for NDCG@10 between CB and CF did not overlap, confirming statistical significance on the synthetic dataset. The study concludes that no single recommendation paradigm simultaneously optimises all quality dimensions, confirming the theoretical justification for hybrid architectures. It must be emphasised that all reported quantitative results are based entirely on synthetic data; no empirical validation on real users exists in this thesis. The Weighted Hybrid with its automatic cold-start fallback (alpha = 0 for new items) provides a principled solution to the item cold-start problem. A fully reproducible nine-file Python pipeline and an interactive Streamlit web application are delivered as the principal artefacts. Future work includes obtaining the real Last.fm dataset and conducting empirical validation on real user data, implementing the Feature-Augmented Hybrid architecture, and conducting a longitudinal user study. KEYWORDS: music recommendation, collaborative filtering, content-based filtering, hybrid systems, matrix factorisation, audio features, Spotify, machine learning, implicit feedback, beyond-accuracy evaluation | |
| dc.description.notification | fi=Opinnäytetyö kokotekstinä PDF-muodossa.|en=Thesis fulltext in PDF format.|sv=Lärdomsprov tillgängligt som fulltext i PDF-format| | |
| dc.format.extent | 85 | |
| dc.identifier.uri | https://osuva.uwasa.fi/handle/11111/20746 | |
| dc.identifier.urn | URN:NBN:fi-fe2026052049875 | |
| dc.language.iso | eng | |
| dc.rights | CC BY-NC-ND 4.0 | |
| dc.subject.degreeprogramme | Master’s Programme in Computing Sciences | |
| dc.subject.discipline | Sustainable and Autonomous Systems | |
| dc.subject.yso | recommender systems | |
| dc.subject.yso | users | |
| dc.subject.yso | music | |
| dc.subject.yso | neural networks (information technology) | |
| dc.subject.yso | applications (computer programmes) | |
| dc.subject.yso | data mining | |
| dc.subject.yso | musical objects | |
| dc.subject.yso | participatory planning | |
| dc.subject.yso | evaluation | |
| dc.subject.yso | live action role-playing games | |
| dc.title | SPOTIFY APP DATA ANALYSIS AND RECOMMENDATION SYSTEM DEVELOPMENT | |
| dc.type.ontasot | fi=Pro gradu -tutkielma|en=Master's thesis|sv=Pro gradu -avhandling| |
Tiedostot
1 - 1 / 1
Ladataan...
- Name:
- Master Thesis Final by Mian 2026.pdf
- Size:
- 1.27 MB
- Format:
- Adobe Portable Document Format
