Instacart Market Basket Analysis 2nd place solution

I made two models for predicting reorder & None. Following are the features I made.

Features

User feature

How often the user reordered items
Time between orders
Time of day the user visits
Whether the user ordered organic, gluten-free, or Asian items in the past
Features based on order sizes
How many of the user’s orders contained no previously purchased items

Item feature

How often the item is purchased
Position in the cart
How many users buy it as "one shot" item
Stats on the number of items that co-occur with this item
Stats on the order streak
Probability of being reordered within N orders
Distribution of the day of week it is ordered
Probability it is reordered after the first order
Statistics around the time between orders

User x Item feature

Number of orders in which the user purchases the item
Days since the user last purchased the item
Streak (number of orders in a row the user has purchased the item)
Position in the cart
Whether the user already ordered the item today
Co-occurrence statistics
Replacement items

datetime feature

Counts by day of week
Counts by hour

More detail, please refer to codes.

F1 maximization

Regarding F1 maximization, I hadn't read that paper until Faron had published the kernel. But I got high score because of my F1 maximization. Let me explain it. For maximizing F1, I generate y_true according to predicted prob. And check F1 from higher prob. For example, lets say we have ordered item and prob, like {A: 0.3, B:0.5, C:0.4}. Then generate y_true in many times. In my case, generated 9999 times. So now we have many of y_true, like [ [A,B],[B],[B,C],[C],[B],[None].....]. As I mentioned above, next thing we do is to check F1 from [B], [B,C], [B,C,A]. Then we can estimate F1 peak out, and stop calculation, and go next order. You may know, in this method, we don't need to check all pattern, like [A],[A,B],[A,B,C],[B]... I guess some might have figured out this method from my comment of "tips to go farther". However, this method is time consuming as well as depends on seed. So finally I used Faron's kernel. Fortunatelly or not, I got almost same result using Faron's kernel. Please refer to py_model/pyx_get_best_items.pyx

How to run

cd py_feature
python 901_run_feature.py
python 902_run_concat.py
cd ../py_model
python 999_run.py

Requirements

Around 300 GB RAM needed(sorry). But I confirmed we can get 0.4073 on private LB with only around 60 GB RAM. Also if you don't have enough memory and want to get high score, try continuous training using xgb_model of xgb.train.

Python packages:

numpy==1.12.1
pandas==0.19.2
scipy==0.19.0
tqdm==4.11.2
xgboost==0.6

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
appendix		appendix
input		input
output/sub/final		output/sub/final
py_feature		py_feature
py_model		py_model
.gitignore		.gitignore
Instacart 2nd Place Solution.pptx		Instacart 2nd Place Solution.pptx
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Instacart Market Basket Analysis 2nd place solution

Features

User feature

Item feature

User x Item feature

datetime feature

F1 maximization

How to run

Requirements

About

Releases

Packages

Languages

License

KazukiOnodera/Instacart

Folders and files

Latest commit

History

Repository files navigation

Instacart Market Basket Analysis 2nd place solution

Features

User feature

Item feature

User x Item feature

datetime feature

F1 maximization

How to run

Requirements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages