Exploration implementing recommender systems using Spotify data.
As of the Spotify developer terms version 9 (8th May 2023), you may not use any content from the API to train machine learning models or AI. Such projects used to be okay and I leave this repo solely for informative purposes.
In recommender_playlists.ipynb, traditional machine learning (ML) methods are applied to yield recommendations based on a set of favourite playlists. In this classification the sci-kit learn implementations of logistic regression and random forest are used alongside xgboost to classify if a track should be considered to be added to the set of favourite playlists.
Although the traditional approach can yield good recommendations, it is heavily reliant on both feature engineering and there being enough features which cover user tastes (which is unlikely for subjective items such as movies and songs). As this approach reliant on users define their favourites (classification) or to accurately score songs (regression), it suffers from both mislabelling inaccuracies and the cold-start problem when a user has a fresh account or doesn't provide any ratings. Hence recommender system approaches are applied in recommender_systems.ipynb.
Here we use the Top-N accuracy metric (in this case Top-100), which applies a recommender system to a dataset of 1 user interacted item and 100 uninteracted items. The perfect recommender system will rank the user interacted item first (highest recommendation strength). This is computed for each item for each user, and then the weighted mean is taken across all users to get a global metric. The metric 'recall@5' for each user is the number of times the interacted item was in the Top-5 items / the number of times it wasn't.
A popularity recommender recommends songs ranked by their popularity regardless of user's preferences. This is of course dependent upon the methodology used to determine the popularity metric (usually some function of time, user interactions, and user ratings).
'recall@5': 0.09318497913769123, 'recall@10': 0.17385257301808066
As it doesn't take user activity into account, solely recommending by popularity is a poor way to recommend tracks. However, as we will see later it is a good method to mix in for variety and to avoid the cold-start problem.
A content-based recommender leverages attributes from items the user has interacted with to recommend similar items. Here the popular TF-IDF method is used to convert unstructured text (unigrams and bigrams of genres and song/artist/album/playlist name) into a sparse matrix , which is then summed and normalized across users to give a vector structure, where each word is represented by a position in the vector and the value measures its relevance. The cosine similarity between the user vector and the initial matrix (all users) then gives a metric to recommend new tracks.
'recall@5': 0.9123783031988874, 'recall@10': 0.972183588317107
A very high recall is observed as the dataset used considers each playlist (which tend to be heavily genre/mood based) as a user. Hence the content-based recommender performs exceedingly well by the Top-N metric but fails to give much variety.
A collaborative recommender can be either memory-based (based on past user interactions) or model-based (e.g. clustering). Here matrix factorisation implemented via singular value decomposition (SVD) is used to compress a user-item matrix into a low-dimensional representation. This yields better scalability and better generalisation. The items x users matrix is then used to recommend items to users based on similar user interactions.
'recall@5': 0.23783031988873435, 'recall@10': 0.30737134909596664
The collaborative approach outperforms the popularity approach but is not a good as the content-based approach. It can suffer from the sparsity problem if the user set is too small or the number of interactions is too low.
A hybrid recommender combines the content-based and collaborative approaches and has been shown to perform better in many studies. It avoids high variance and enables variety and weighting (e.g. genre weighting). As the content-based approach performs better by the Top-N metric it is weighed more strongly here.
'recall@5': 0.9068150208623088, 'recall@10': 0.9666203059805285
This weighting approach has the same issue as the content-based recommender - too much weighting on genre and not enough variety. Here the popularity approach is also now incorporated with a weighting to give a hybrid + popularity recommender.
'recall@5': 0.6244784422809457, 'recall@10': 0.7343532684283728
Even though it has a lower recall, subjectively this recommender appears to give the best recommendations in practice. It may therefore be better to incorporate other evaluation metrics such as one that measures variety, or increase the scope of this dataset as its current playlist based approach means that pure genre-based recommendations perform the best.
- Expand the dataset to include other users (rather than proxying playlists as users)
- Develop other evaluation metrics (although this is somewhat solved by expanding the dataset)
- Access the Spotify API
- Setup your project including the project settings (I used https://localhost:9001/callback as the redirect URI)
- Create the spotify/spotify_details.yml yaml with the Spotify APi client_id, client_secret, and redirect_uri
- Run music_data.py to pull Spotify data to local pandas dataframes
- Explore the traditional ML implementations in recommender_playlists.ipynb and recommender system implementations in recommender_systems.ipynb
- Enjoy the resulting playlists!
Modules:
- music_data.py: Script file for pulling Spotify music data
- data_functions.py: Helper functions for pulling Spotify music data
- recommender_playlists.ipynb: Jupyter notebook to recommend tracks based on traditional ML techniques
- recommender_systems.ipynb: Jupyter notebook to recommend tracks using implementation and evaluation of popularity, content-based, collaborative, and hybrid recommendation system approaches. Made with reference to recommender-systems-in-python-101 on Kaggle
The gitignore'd 'spotify' folder contains locally saved pandas dataframes from music_data.py as well as Spotify API details spotify/spotify_details.yml and playlist ids spotify/playlists.yml
- spotipy has a repo of examples available on GitHub
- Spotify for Developers has a dashboard to manage your project as well as development guides and an API reference guide