Skip to content

Commit

Permalink
added basic tests
Browse files Browse the repository at this point in the history
  • Loading branch information
Eric Lundquist authored and Eric Lundquist committed May 26, 2020
1 parent 715f23f commit ef5dfd0
Show file tree
Hide file tree
Showing 4 changed files with 227 additions and 35 deletions.
9 changes: 2 additions & 7 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# excluded folders
# exclude data and private notebooks
data/
notebooks/old/
examples/ignore/

# system files
*.DS_Store
Expand All @@ -17,18 +17,13 @@ lib/
lib64/
parts/
sdist/
var/
wheels/
pip-wheel-metadata/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# sphinx documentation
docs/_build/

# spark stuff
*/derby.log
*/metastore_db/
21 changes: 10 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,12 @@
# RankFM

RankFM is a python implementation of the general Factorization Machines model class described in [Rendle 2010](https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf) adapted for collaborative filtering recommendation/ranking problems with implicit feedback user-item interaction data. It uses the Bayesian Personalized Ranking (BPR-OPT) optimization criteria described in [Rendle 2009](https://arxiv.org/pdf/1205.2618.pdf) to learn model weights via Stochastic Gradient Descent (SGD). It can also incorporate user and/or item auxiliary features to augment the main interaction data which may increase model performance, especially in contexts where interaction data is highly sparse but rich user/item metadata features exist.
RankFM is a python implementation of the general Factorization Machines model class described in [Rendle 2010](https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf) adapted for collaborative filtering recommendation/ranking problems with implicit feedback user-item interaction data. It uses the Bayesian Personalized Ranking (BPR-OPT) optimization criteria described in [Rendle 2009](https://arxiv.org/pdf/1205.2618.pdf) to learn model weights via Stochastic Gradient Descent (SGD). It can also incorporate user and/or item auxiliary features to augment the main interaction data, which may increase model performance, especially in contexts where the interaction data is highly sparse but rich user and/or item metadata features exist.

RankFM's core training/prediction/recommendation subroutines are converted to optimized machine code at runtime using the excellent [Numba](http:https://numba.pydata.org/) LLVM JIT compiler which can compile Python numerical algorithms to run at speeds approaching C/Fortran. This makes it possible to scale model training and recommendation to millions of user/item interactions.
The core training/prediction/recommendation subroutines are converted to optimized machine code at runtime using the [Numba](http:https://numba.pydata.org/) LLVM JIT compiler. This makes it possible to scale model training and recommendation to millions of user/item interactions. Designed for ease-of-use, RankFM accepts both `pd.DataFrame` and `np.ndarray` inputs. You do not have to convert your data to `scipy.sparse` matrices or re-map user/item identifiers to array indexes prior to use - RankFM internally maps all user/item identifiers to zero-based integer indexes, but always converts its outputs back to the original user/item identifiers from your data, which can be arbitrary (non-zero-based, non-consecutive) integers or even strings.

Designed for ease-of-use, RankFM accepts both `pd.DataFrame` and `np.ndarray` inputs. You do not have to convert your data to `scipy.sparse` matrices or re-map user/item identifiers to array indexes prior to use - internally RankFM maps all user/item identifiers to zero-based integer indexes, but always converts its output back to the original user/item identifiers from your data, which can be arbitrary (non-zero-based, non-consecutive) integers or even strings.
In addition to the familiar `fit()`, `predict()`, `recommend()` methods, RankFM includes additional utilities `similiar_users()` and `similar_items()` to find the most similar users/items to a given user/item based on latent factor space embeddings. A number of popular recommendation/ranking evaluation metric functions have been included in the separate `evaluation` module to streamline model tuning and validation. See the **Quickstart** section below to get started, and the `quickstart.ipynb` notebook in the `/examples` folder for a more in-depth walkthrough.

In addition to the familiar `fit()`, `predict()`, `recommend()` methods, RankFM includes additional utilities to find the most similar users/items to a given user/item based on user/item latent factor space embeddings. A number of popular recommendation/ranking evaluation metric functions are also included in the `evaluation` module to streamline model performance tuning and evaluation.

See the **Quickstart** section below to get started, and the `quickstart.ipynb` notebook in the `/examples` folder for a more in-depth walkthrough. This package is currently under active development pre-release, and should not yet be considered stable. Release, build status, and PyPI information will be added once things get to a stable and satisfactory state for an initial release. The core functionality is mostly in place and working, but automated tests and CI workflows need to be added, and I need to teach myself how to do all that stuff first :)
This package is currently under active development pre-release, and should not yet be considered stable. Release, build status, and PyPI information will be added once things get to a stable and satisfactory state for an initial release. The core functionality is mostly in place and working, but automated tests and CI workflows need to be added, and I need to teach myself how to do all that stuff first :)

---
### Dependencies
Expand All @@ -34,7 +32,7 @@ Let's first look at the required shape of the interaction data:
| 5 | 377 |
| 8 | 610 |

It has just two columns: a `user_id` and an `item_id` (although you can name these fields whatever you want or use a numpy array instead). Notice that there is no `rating` column - this library is for **implicit feedback** data (e.g. watches, page views, purchases, clicks) as opposed to **explicit feedback** data (e.g. 1-5 ratings, thumbs up/down). Implicit feedback is far more common in real-world recommendation contexts and doesn't suffer from the missing-not-at-random problem of pure explicit feedback approaches. Maciej Kula (legendary open-source recsys developer) provides an [excellent overview of the differences](https://resources.bibblio.org/hubfs/share/2018-01-24-RecSysLDN-Ravelin.pdf).
It has just two columns: a `user_id` and an `item_id` (you can name these fields whatever you want or use a numpy array instead). Notice that there is no `rating` column - this library is for **implicit feedback** data (e.g. watches, page views, purchases, clicks) as opposed to **explicit feedback** data (e.g. 1-5 ratings, thumbs up/down). Implicit feedback is far more common in real-world recommendation contexts and doesn't suffer from the missing-not-at-random problem of pure explicit feedback approaches. Maciej Kula (legendary open-source recsys developer) provides an [excellent overview of the differences](https://resources.bibblio.org/hubfs/share/2018-01-24-RecSysLDN-Ravelin.pdf).

Now let's import the library, initialize our model, and fit on the training data:
```python
Expand All @@ -44,13 +42,13 @@ model = RankFM(factors=10, regularization=0.01, learning_rate=0.1, learning_sche
model.fit(interactions_train, epochs=20, verbose=True)
# NOTE: this takes about 90 seconds for 750,000 interactions on my 2.3 GHz i5 8GB RAM MacBook
```
If you set `verbose=True` the model will print the current epoch number as well as the epoch's log-likelihood during training. This can be useful to gauge both computational speed and training performance by epoch. If the log likelihood is not increasing then try upping the `learning_rate` or lowering the `regularization`. If the log likelihood is starting to sometimes decrease in later training epochs try lowering the `learning_rate` or using `learning_schedule='invscaling'` to gradually decrease the learning rate over time.
If you set `verbose=True` the model will print the current epoch number as well as the epoch's log-likelihood during training. This can be useful to gauge both computational speed and training performance by epoch. If the log likelihood is not increasing then try upping the `learning_rate` or lowering the `regularization`. If the log likelihood is starting to bounce up and down try lowering the `learning_rate` or using `learning_schedule='invscaling'` to decrease the learning rate over time.

Now let's generate some user-item model scores from the validation data:
```python
valid_scores = model.predict(interactions_valid, cold_start='nan')
```
this will produce an array of real-valued model scores generated using the Factorization Machine model equation. You can interpret it as a measure of the predicted utility of a user (u) getting recommended an item (i). The `cold_start='nan'` option can be used to set scores to `np.nan` for user/item pairs not found in the training data, or `cold_start='drop'` can be specified to drop those pairs so the results contain no missing values.
this will produce an array of real-valued model scores generated using the Factorization Machines model equation. You can interpret it as a measure of the predicted utility of item (i) for user (u). The `cold_start='nan'` option can be used to set scores to `np.nan` for user/item pairs not found in the training data, or `cold_start='drop'` can be specified to drop those pairs so the results contain no missing values.

Now let's generate our topN recommended movies for each user:
```python
Expand Down Expand Up @@ -81,6 +79,7 @@ dcg: 0.704
precision: 0.152
recall: 0.068
```
[That's a Bingo!](https://www.youtube.com/watch?v=q5pESPQpXxE)

Now let's find the most similar other movies for a few movies based on their embedding representations in latent factor space:
```python
Expand All @@ -99,7 +98,7 @@ model.similar_items(589, n_items=10)
480 Jurassic Park (1993)
1200 Aliens (1986)
```
I hope you like explosions...
[I hope you like explosions...](https://www.youtube.com/watch?v=uENYMZNzg9w)

```python
# Being John Malkovich (1999)
Expand All @@ -117,7 +116,7 @@ model.similar_items(2997, n_items=10)
2908 Boys Don't Cry (1999)
3481 High Fidelity (2000)
```
Let's get weird...
[Let's get weird...](https://www.youtube.com/watch?v=lIpev8JXJHQ&t=5s)

---
That's all for now. To see more in-depth worked examples in jupyter notebook format head to the `/examples` folder. Be sure to check back for added functionality and PyPI release status in the near future as soon as I teach myself how to use CI workflows and go where few data scientists have gone before: a comprehensive set of unit tests. Stay tuned...
Expand Down
29 changes: 13 additions & 16 deletions rankfm/rankfm.py
Original file line number Diff line number Diff line change
Expand Up @@ -247,14 +247,12 @@ def fit_partial(self, interactions, user_features=None, item_features=None, epoc
:return: self
"""

# initialize necessary internal data structures
if self.is_fit:
self._init_interactions(interactions)
self._init_features(user_features, item_features)
else:
self._init_all(interactions, user_features, item_features)

# call numba internals
updated_weights = _fit(
self.interactions,
self.user_items_nb,
Expand Down Expand Up @@ -285,18 +283,16 @@ def predict(self, pairs, cold_start='nan'):
:param pairs: dataframe of [user, item] pairs to score
:param cold_start: whether to generate missing values ('nan') or drop ('drop') user/item pairs not found in training data
:return: vector of real-valued model scores
:return: np.array of real-valued model scores
"""

# ensure that the model has been fit before attempting to generate predictions
assert pairs.shape[1] == 2, "[pairs] should be: [user_id, item_id]"
assert self.is_fit, "you must fit the model prior to generating predictions"

# map raw user/item identifiers to internal index positions
pred_pairs = pd.DataFrame(pairs.copy(), columns=['user_id', 'item_id'])
pred_pairs['user_id'] = pred_pairs['user_id'].map(self.user_to_index)
pred_pairs['item_id'] = pred_pairs['item_id'].map(self.item_to_index)

# call numba internals
pred_pairs = pred_pairs.to_numpy().astype(np.float32)
scores = _predict(
pred_pairs,
Expand All @@ -321,17 +317,16 @@ def predict(self, pairs, cold_start='nan'):
def recommend(self, users, n_items=10, filter_previous=False, cold_start='nan'):
"""calculate the topN items for each user
:param users: list-like of user identifiers for which to generate recommendations
:param users: iterable of user identifiers for which to generate recommendations
:param n_items: number of recommended items to generate for each user
:param filter_previous: remove observed training items from generated recommendations
:param cold_start: whether to generate missing values ('nan') or drop ('drop') users not found in training data
:return: pandas dataframe where the index values are user identifiers and the columns are recommended items
"""

# ensure that the model has been fit before attempting to generate predictions
assert getattr(users, '__iter__', False), "[users] must be an iterable (e.g. list, array, series)"
assert self.is_fit, "you must fit the model prior to generating recommendations"

# call numba internals
user_idx = pd.Series(users).map(self.user_to_index).to_numpy(dtype=np.float32)
rec_items = _recommend(
user_idx,
Expand Down Expand Up @@ -362,10 +357,11 @@ def similar_items(self, item_id, n_items=10):
:param item_id: item to search
:param n_items: number of similar items to return
:return: topN most similar items wrt latent factor representations
:return: np.array of topN most similar items wrt latent factor representations
"""

# ensure that the model has been fit before attempting to generate predictions
assert item_id in self.item_id, "you must select an [item_id] present in the training data"
assert self.is_fit, "you must fit the model prior to generating similarities"

try:
Expand All @@ -379,7 +375,7 @@ def similar_items(self, item_id, n_items=10):

# calculate the most similar N items excluding the search item
similarities = pd.Series(np.dot(lr_all_items, lr_item)).drop(item_idx).sort_values(ascending=False)[:n_items]
most_similar = pd.Series(similarities.index).map(self.index_to_item)
most_similar = pd.Series(similarities.index).map(self.index_to_item).values
return most_similar


Expand All @@ -388,23 +384,24 @@ def similar_users(self, user_id, n_users=10):
:param user_id: user to search
:param n_users: number of similar users to return
:return: topN most similar users wrt latent factor representations
:return: np.array of topN most similar users wrt latent factor representations
"""

# ensure that the model has been fit before attempting to generate predictions
assert user_id in self.user_id, "you must select an [user_id] present in the training data"
assert self.is_fit, "you must fit the model prior to generating similarities"

try:
user_idx = self.user_to_index.loc[user_id]
except (KeyError, TypeError):
print("user_id={} not found in training data".format(user_id))

# calculate item latent representations in F dimensional factor space
lr_user = self.v_i[user_idx] + np.dot(self.v_uf.T, self.x_uf[user_idx])
lr_all_users = self.v_i + np.dot(self.x_uf, self.v_uf)
# calculate user latent representations in F dimensional factor space
lr_user = self.v_u[user_idx] + np.dot(self.v_uf.T, self.x_uf[user_idx])
lr_all_users = self.v_u + np.dot(self.x_uf, self.v_uf)

# calculate the most similar N users excluding the search user
similarities = pd.Series(np.dot(lr_all_users, lr_user)).drop(user_idx).sort_values(ascending=False)[:n_users]
most_similar = pd.Series(similarities.index).map(self.index_to_user)
most_similar = pd.Series(similarities.index).map(self.index_to_user).values
return most_similar

Loading

0 comments on commit ef5dfd0

Please sign in to comment.