Skip to content

DavyMorgan/dps

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

96 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

dps

data process service

Overview

The data process pipeline of recommendation system research usually follows the fashion of load raw data, filter data, reindex data, split train/val/test datasets, save file and also negative sampling for training recommendation models. This repo serves as a general tool for those above data process operations.

Running examples

data process of a sampled taobao CTR dataset including 1M user-item interactions.

cd examples
python taobao_ctr.py

Requirements

numpy
pandas
scipy
absl-py

Components

The best entry point to use the following components is a DataFrame with 'uid', 'iid' and 'ts' in its columns.

loader

CsvLoader: load csv file
CooLoader: load coo file (sparse matrix in coordidate format)
JsonLoader: load json file

filter

CFFilter: k-core filter
DuplicationFilter: filter duplicated records with the earliest record left

reindexer

Reindexer: reindex uid and iid, start from 0

splitter

AbsoluteSplitter: split the dataset with test and validation sample number fixed
PercentageSplitter: split the dataset proportionally in chronological order
RandomSplitter: split the dataset randomly
SkewSplitter: split the dataset into biased and unbiased parts according to related literatures (PF, CausE, DICE).

generator

CooGenerator: generate sparse matrix in coordinate format
LilGenerator: generate sparse matrix in lists in list format
DokGenerator: generate sparse matrix in dictionary of keys format

transformer

SparseTransformer: perform sparse matrix format transformation from coo to lil and dok

saver

CsvSaver: save DataFrame to file
CooSaver: save coo matrix to file
JsonSaver: save dict to file

reporter

CsvReporter: report statistics of the data

sampler

PointSampler: negative sampling for pointwise optimization such as logloss
PairSampler: negtive sampling for pairwise optimization such as bprloss

About

data process service

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages