added basic tests

etlundquist · May 26, 2020 · ef5dfd0 · ef5dfd0
1 parent 715f23f
commit ef5dfd0
Show file tree

Hide file tree

Showing 4 changed files with 227 additions and 35 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,6 +1,6 @@
-# excluded folders
+# exclude data and private notebooks
 data/
-notebooks/old/
+examples/ignore/
 
 # system files
 *.DS_Store
@@ -17,18 +17,13 @@ lib/
 lib64/
 parts/
 sdist/
-var/
-wheels/
 pip-wheel-metadata/
 share/python-wheels/
 *.egg-info/
 .installed.cfg
 *.egg
 MANIFEST
 
-# sphinx documentation
-docs/_build/
-
 # spark stuff
 */derby.log
 */metastore_db/
diff --git a/README.md b/README.md
@@ -1,14 +1,12 @@
 # RankFM
 
-RankFM is a python implementation of the general Factorization Machines model class described in [Rendle 2010](https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf) adapted for collaborative filtering recommendation/ranking problems with implicit feedback user-item interaction data. It uses the Bayesian Personalized Ranking (BPR-OPT) optimization criteria described in [Rendle 2009](https://arxiv.org/pdf/1205.2618.pdf) to learn model weights via Stochastic Gradient Descent (SGD). It can also incorporate user and/or item auxiliary features to augment the main interaction data which may increase model performance, especially in contexts where interaction data is highly sparse but rich user/item metadata features exist.
+RankFM is a python implementation of the general Factorization Machines model class described in [Rendle 2010](https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf) adapted for collaborative filtering recommendation/ranking problems with implicit feedback user-item interaction data. It uses the Bayesian Personalized Ranking (BPR-OPT) optimization criteria described in [Rendle 2009](https://arxiv.org/pdf/1205.2618.pdf) to learn model weights via Stochastic Gradient Descent (SGD). It can also incorporate user and/or item auxiliary features to augment the main interaction data, which may increase model performance, especially in contexts where the interaction data is highly sparse but rich user and/or item metadata features exist.
 
-RankFM's core training/prediction/recommendation subroutines are converted to optimized machine code at runtime using the excellent [Numba](http:https://numba.pydata.org/) LLVM JIT compiler which can compile Python numerical algorithms to run at speeds approaching C/Fortran. This makes it possible to scale model training and recommendation to millions of user/item interactions.
+The core training/prediction/recommendation subroutines are converted to optimized machine code at runtime using the [Numba](http:https://numba.pydata.org/) LLVM JIT compiler. This makes it possible to scale model training and recommendation to millions of user/item interactions. Designed for ease-of-use, RankFM accepts both `pd.DataFrame` and `np.ndarray` inputs. You do not have to convert your data to `scipy.sparse` matrices or re-map user/item identifiers to array indexes prior to use - RankFM internally maps all user/item identifiers to zero-based integer indexes, but always converts its outputs back to the original user/item identifiers from your data, which can be arbitrary (non-zero-based, non-consecutive) integers or even strings.
 
-Designed for ease-of-use, RankFM accepts both `pd.DataFrame` and `np.ndarray` inputs. You do not have to convert your data to `scipy.sparse` matrices or re-map user/item identifiers to array indexes prior to use - internally RankFM maps all user/item identifiers to zero-based integer indexes, but always converts its output back to the original user/item identifiers from your data, which can be arbitrary (non-zero-based, non-consecutive) integers or even strings.
+In addition to the familiar `fit()`, `predict()`, `recommend()` methods, RankFM includes additional utilities `similiar_users()` and `similar_items()` to find the most similar users/items to a given user/item based on latent factor space embeddings. A number of popular recommendation/ranking evaluation metric functions have been included in the separate `evaluation` module to streamline model tuning and validation. See the **Quickstart** section below to get started, and the `quickstart.ipynb` notebook in the `/examples` folder for a more in-depth walkthrough.
 
-In addition to the familiar `fit()`, `predict()`, `recommend()` methods, RankFM includes additional utilities to find the most similar users/items to a given user/item based on user/item latent factor space embeddings. A number of popular recommendation/ranking evaluation metric functions are also included in the `evaluation` module to streamline model performance tuning and evaluation.
-
-See the **Quickstart** section below to get started, and the `quickstart.ipynb` notebook in the `/examples` folder for a more in-depth walkthrough. This package is currently under active development pre-release, and should not yet be considered stable. Release, build status, and PyPI information will be added once things get to a stable and satisfactory state for an initial release. The core functionality is mostly in place and working, but automated tests and CI workflows need to be added, and I need to teach myself how to do all that stuff first :)
+This package is currently under active development pre-release, and should not yet be considered stable. Release, build status, and PyPI information will be added once things get to a stable and satisfactory state for an initial release. The core functionality is mostly in place and working, but automated tests and CI workflows need to be added, and I need to teach myself how to do all that stuff first :)
 
 ---
 ### Dependencies
@@ -34,7 +32,7 @@ Let's first look at the required shape of the interaction data:
 | 5 | 377 |
 | 8 | 610 |
 
-It has just two columns: a `user_id` and an `item_id` (although you can name these fields whatever you want or use a numpy array instead). Notice that there is no `rating` column - this library is for **implicit feedback** data (e.g. watches, page views, purchases, clicks) as opposed to **explicit feedback** data (e.g. 1-5 ratings, thumbs up/down). Implicit feedback is far more common in real-world recommendation contexts and doesn't suffer from the missing-not-at-random problem of pure explicit feedback approaches. Maciej Kula (legendary open-source recsys developer) provides an [excellent overview of the differences](https://resources.bibblio.org/hubfs/share/2018-01-24-RecSysLDN-Ravelin.pdf).
+It has just two columns: a `user_id` and an `item_id` (you can name these fields whatever you want or use a numpy array instead). Notice that there is no `rating` column - this library is for **implicit feedback** data (e.g. watches, page views, purchases, clicks) as opposed to **explicit feedback** data (e.g. 1-5 ratings, thumbs up/down). Implicit feedback is far more common in real-world recommendation contexts and doesn't suffer from the missing-not-at-random problem of pure explicit feedback approaches. Maciej Kula (legendary open-source recsys developer) provides an [excellent overview of the differences](https://resources.bibblio.org/hubfs/share/2018-01-24-RecSysLDN-Ravelin.pdf).
 
 Now let's import the library, initialize our model, and fit on the training data:
 ```python
@@ -44,13 +42,13 @@ model = RankFM(factors=10, regularization=0.01, learning_rate=0.1, learning_sche
 model.fit(interactions_train, epochs=20, verbose=True)
 # NOTE: this takes about 90 seconds for 750,000 interactions on my 2.3 GHz i5 8GB RAM MacBook
 ```
-If you set `verbose=True` the model will print the current epoch number as well as the epoch's log-likelihood during training. This can be useful to gauge both computational speed and training performance by epoch. If the log likelihood is not increasing then try upping the `learning_rate` or lowering the `regularization`. If the log likelihood is starting to sometimes decrease in later training epochs try lowering the `learning_rate` or using `learning_schedule='invscaling'` to gradually decrease the learning rate over time.
+If you set `verbose=True` the model will print the current epoch number as well as the epoch's log-likelihood during training. This can be useful to gauge both computational speed and training performance by epoch. If the log likelihood is not increasing then try upping the `learning_rate` or lowering the `regularization`. If the log likelihood is starting to bounce up and down try lowering the `learning_rate` or using `learning_schedule='invscaling'` to decrease the learning rate over time.
 
 Now let's generate some user-item model scores from the validation data:
 ```python
 valid_scores = model.predict(interactions_valid, cold_start='nan')
 ```
-this will produce an array of real-valued model scores generated using the Factorization Machine model equation. You can interpret it as a measure of the predicted utility of a user (u) getting recommended an item (i). The `cold_start='nan'` option can be used to set scores to `np.nan` for user/item pairs not found in the training data, or `cold_start='drop'` can be specified to drop those pairs so the results contain no missing values.
+this will produce an array of real-valued model scores generated using the Factorization Machines model equation. You can interpret it as a measure of the predicted utility of item (i) for user (u). The `cold_start='nan'` option can be used to set scores to `np.nan` for user/item pairs not found in the training data, or `cold_start='drop'` can be specified to drop those pairs so the results contain no missing values.
 
 Now let's generate our topN recommended movies for each user:
 ```python
@@ -81,6 +79,7 @@ dcg: 0.704
 precision: 0.152
 recall: 0.068
 ```
+[That's a Bingo!](https://www.youtube.com/watch?v=q5pESPQpXxE)
 
 Now let's find the most similar other movies for a few movies based on their embedding representations in latent factor space:
 ```python
@@ -99,7 +98,7 @@ model.similar_items(589, n_items=10)
 480 Jurassic Park (1993)
 1200 Aliens (1986)
 ```
-I hope you like explosions...
+[I hope you like explosions...](https://www.youtube.com/watch?v=uENYMZNzg9w)
 
 ```python
 # Being John Malkovich (1999)
@@ -117,7 +116,7 @@ model.similar_items(2997, n_items=10)
 2908 Boys Don't Cry (1999)
 3481 High Fidelity (2000)
 ```
-Let's get weird...
+[Let's get weird...](https://www.youtube.com/watch?v=lIpev8JXJHQ&t=5s)
 
 ---
 That's all for now. To see more in-depth worked examples in jupyter notebook format head to the `/examples` folder. Be sure to check back for added functionality and PyPI release status in the near future as soon as I teach myself how to use CI workflows and go where few data scientists have gone before: a comprehensive set of unit tests. Stay tuned...

diff --git a/rankfm/rankfm.py b/rankfm/rankfm.py
@@ -247,14 +247,12 @@ def fit_partial(self, interactions, user_features=None, item_features=None, epoc
  :return: self
  """
 
- # initialize necessary internal data structures
  if self.is_fit:
  self._init_interactions(interactions)
  self._init_features(user_features, item_features)
  else:
  self._init_all(interactions, user_features, item_features)
 
- # call numba internals
  updated_weights = _fit(
  self.interactions,
  self.user_items_nb,
@@ -285,18 +283,16 @@ def predict(self, pairs, cold_start='nan'):
 
  :param pairs: dataframe of [user, item] pairs to score
  :param cold_start: whether to generate missing values ('nan') or drop ('drop') user/item pairs not found in training data
- :return: vector of real-valued model scores
+ :return: np.array of real-valued model scores
  """
 
- # ensure that the model has been fit before attempting to generate predictions
+ assert pairs.shape[1] == 2, "[pairs] should be: [user_id, item_id]"
  assert self.is_fit, "you must fit the model prior to generating predictions"
 
- # map raw user/item identifiers to internal index positions
  pred_pairs = pd.DataFrame(pairs.copy(), columns=['user_id', 'item_id'])
  pred_pairs['user_id'] = pred_pairs['user_id'].map(self.user_to_index)
  pred_pairs['item_id'] = pred_pairs['item_id'].map(self.item_to_index)
 
- # call numba internals
  pred_pairs = pred_pairs.to_numpy().astype(np.float32)
  scores = _predict(
  pred_pairs,
@@ -321,17 +317,16 @@ def predict(self, pairs, cold_start='nan'):
  def recommend(self, users, n_items=10, filter_previous=False, cold_start='nan'):
  """calculate the topN items for each user
 
- :param users: list-like of user identifiers for which to generate recommendations
+ :param users: iterable of user identifiers for which to generate recommendations
  :param n_items: number of recommended items to generate for each user
  :param filter_previous: remove observed training items from generated recommendations
  :param cold_start: whether to generate missing values ('nan') or drop ('drop') users not found in training data
  :return: pandas dataframe where the index values are user identifiers and the columns are recommended items
  """
 
- # ensure that the model has been fit before attempting to generate predictions
+ assert getattr(users, '__iter__', False), "[users] must be an iterable (e.g. list, array, series)"
  assert self.is_fit, "you must fit the model prior to generating recommendations"
 
- # call numba internals
  user_idx = pd.Series(users).map(self.user_to_index).to_numpy(dtype=np.float32)
  rec_items = _recommend(
  user_idx,
@@ -362,10 +357,11 @@ def similar_items(self, item_id, n_items=10):
 
  :param item_id: item to search
  :param n_items: number of similar items to return
- :return: topN most similar items wrt latent factor representations
+ :return: np.array of topN most similar items wrt latent factor representations
  """
 
  # ensure that the model has been fit before attempting to generate predictions
+ assert item_id in self.item_id, "you must select an [item_id] present in the training data"
  assert self.is_fit, "you must fit the model prior to generating similarities"
 
  try:
@@ -379,7 +375,7 @@ def similar_items(self, item_id, n_items=10):
 
  # calculate the most similar N items excluding the search item
  similarities = pd.Series(np.dot(lr_all_items, lr_item)).drop(item_idx).sort_values(ascending=False)[:n_items]
- most_similar = pd.Series(similarities.index).map(self.index_to_item)
+ most_similar = pd.Series(similarities.index).map(self.index_to_item).values
  return most_similar
 
 
@@ -388,23 +384,24 @@ def similar_users(self, user_id, n_users=10):
 
  :param user_id: user to search
  :param n_users: number of similar users to return
- :return: topN most similar users wrt latent factor representations
+ :return: np.array of topN most similar users wrt latent factor representations
  """
 
  # ensure that the model has been fit before attempting to generate predictions
+ assert user_id in self.user_id, "you must select an [user_id] present in the training data"
  assert self.is_fit, "you must fit the model prior to generating similarities"
 
  try:
  user_idx = self.user_to_index.loc[user_id]
  except (KeyError, TypeError):
  print("user_id={} not found in training data".format(user_id))
 
- # calculate item latent representations in F dimensional factor space
- lr_user = self.v_i[user_idx] + np.dot(self.v_uf.T, self.x_uf[user_idx])
- lr_all_users = self.v_i + np.dot(self.x_uf, self.v_uf)
+ # calculate user latent representations in F dimensional factor space
+ lr_user = self.v_u[user_idx] + np.dot(self.v_uf.T, self.x_uf[user_idx])
+ lr_all_users = self.v_u + np.dot(self.x_uf, self.v_uf)
 
  # calculate the most similar N users excluding the search user
  similarities = pd.Series(np.dot(lr_all_users, lr_user)).drop(user_idx).sort_values(ascending=False)[:n_users]
- most_similar = pd.Series(similarities.index).map(self.index_to_user)
+ most_similar = pd.Series(similarities.index).map(self.index_to_user).values
  return most_similar