This project is related to the kaggle competition. The goal is to develop product recommendations based on data from previous transactions as well as from customer and product meta data. The available meta data spans from simple data, such as garment type and customer age, to text data from product descriptions and image data from garment images.
The final task is to recommend up to 12 products for customers to purchase during the 7-day period immediately after the training data ends. Performance is evaluated according to the Mean Average Precision @ 12 (MAP@12).
The dataset can be downloaded here.
articles.csv
: detailed metadata for eacharticle_id
available for purchase
customers.csv
: metadata for eachcustomer_id
in dataset
transactions_train.csv
: the training data, consisting purchase log of customers
images folder
: a folder of images corresponding to eacharticle_id
However, articles.csv
contains detailed metadata, such as shape, material, and color information for each article. This project assumes that the information gathered from article.csv
is enough, so images from the images folder
will not be used to train the model.
- This project performed in the two notebooks.
- 1st notebook:
EDA(Exploratory Data Analysis)
- This Exploratory Data Analysis Notebook will look the data, analyze the content, check for missing data, understand the data distribution, see what are the relations between data in various files, and do some various visualizations and statistical analysis.
- 2nd notebook:
Candidate generation and Model
- This notebook prepares the data, reducing the amount of data train needed from
4.5GB + 512MB + 117MB
to788MB + 17MB + 11MB
. Achieve 6x memory reduction!.- Generates candidates as negative examples for training the model and for evaluation. For each customer, candidate products generated are 12 best seller last week, the most recent product purchased by the customer, and the best seller product for the week in which the customer purchased.
- Feed data training and candidates to LGBMRanker and use the ranker to output predictions. Get 0.2045 score and 1798/2952 place ~ 40% better than other competitors.