Skip to content

Commit

Permalink
init commit
Browse files Browse the repository at this point in the history
  • Loading branch information
chenyihong committed Aug 16, 2017
0 parents commit ac59a39
Show file tree
Hide file tree
Showing 20 changed files with 9,854 additions and 0 deletions.
2,431 changes: 2,431 additions & 0 deletions EDA and Feat Craft.ipynb

Large diffs are not rendered by default.

559 changes: 559 additions & 0 deletions Evaluation and Bagging.ipynb

Large diffs are not rendered by default.

87 changes: 87 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
# Instacart Market Basket Analysis

This repository contains my solution for the Instacart Market Analysis Competition hosted on kaggle. It helped me earn 39/2669 place. For anyone who is interested, please check [this page](https://www.kaggle.com/c/instacart-market-basket-analysis) for details about the Instacart competition.

## Dataset

[the Instacart dataset](https://tech.instacart.com/3-million-instacart-orders-open-sourced-d40d29ead6f2) and can be used in other e-commerce datasets by modifying the input easily.

## Solution

The problem is formulated as approximating P(u,p | user's prior purchase history) which stands for how much likely user u would repurchase product p given prior purchase history.

The main model should be a binary classifer and features created manually or automatically are feed to the classfier for generating predictions.

### Features
I construted both manual features and automatic features using unsupervised learning and neural networks.

Manual features include statistics of prior purchase history. As for automatic features, I used the following

> LDA
> * each user is treated as a document and each product is treated as a word.
> * generate topic representation of each user and each product
> * calculate score by taking inner product for <u,p>
> * similar operations on aisle and department level
> * both score <u,p> and compressed topic representations of users/items serves as good features
> WORD2VEC
> * similar as LDA, but only on product level
> LSTM
> * Interval between user u's sequential purchase of product p is modeled as a time sequence.
> * Use LSTM to construct regression model for predicting next value of this time sequence.
> * The predicted next interval serves as a good feature.
> DREAM
> * RNN and bayesian personalized rank based Model, refer to this repo for my implementation
> * DREAM provides <u,p> scores, dynamic user representaions and item embeddings.
> * It captures the sequential information such as periodicity in users' prior purchase history.
### Classifier

I constructed both lightgbm model and xgboost model.

### Optimization

I used bayesian optimization to tune my lightgbm model.

### Post-classification

I used [this script](https://www.kaggle.com/tarobxl/parallel-version-of-faron-s-script/) to contruct orders from <u,p> pair. Thanks to faron, shing and tarbox !

### Ensemble

I trained big models (500+ features), median models(260 + features) and small models(80+ features). Final submissions were generated by bagging top models using median.

## Files

Python Files

> `bayes_optim_lgb.py`
> * lightGBM model tuned by bayesian optimization
> `lgb_cv.py`
> * lightGBM model 5-fold cv
> `xgb_train_eval_test.py`
> * xgboost model
> `transactions.py`
> * craft features manually from raw transaction log/user purchase history
> `feats.py'
> * combine all features and make train/test dataset
> `inference.py`
> * construct orders using P(u,p)
> `evaluation.py`
> * some functions related to local evaluation
> `constants.py`
> * some constants such as file path
> `utils.py`
> * some useful functions
Jupyter notebbooks

> `EDA and Feat Craft`
> * dataset exploration and feature crafting
> `Evaluation and Bagging`
> * local evaluation and bagging models
> `Submission and Bagging`
> * generate submissions
## License

Copyright (c) 2017 Yihong Chen
Loading

0 comments on commit ac59a39

Please sign in to comment.