Source code of the paper "ReClean: Reinforcement Learning for Automated Data Cleaning in ML Pipelines," at Proceddings of the DBML'24 (ICDE Workshop), May 2024.
- Inject errors on a given dataset using error-generator library, with given methods and their corresponding error rates as function arguments.
- Get detections and features from dirty datasets using ED2. Intended feature extraction method can be specified as function argument.
- The class including cleaners used in cleaner inventory of the RDCL framework. If new cleaners are added to the class, RDCL class also must be updated by specifying the indices of new cleaners.
- The class that apply preprocessing steps on dirty and validation datasets after cleaning is done, including options for normalization, dropping of nans, one-hot encoding of categorical features, balancing of datasets through upsampling and downsampling, dropping of duplicates, and so on.
- The main class of the project that merges everything each component of the framework. The framework has the following steps respectively;
- sampling a batch and their corresponding detections and feature vectors,
- execution of selected cleaners using
Cleaners.py
to clean the dirty batch, - preprocessing the cleaned datasets using
DataPreprocessing.py
, - getting reward -a performance metric- using predictor network trained on the cleaned dataset.
- Example script for loading framework parameters and running an experiment for RDCL pipeline and then, getting the performance results of the RDCL and baseline cleaners.
The project can be extended by running on new datasets, new cleaners, new predictor models, or even new tasks, e.g. clustering, by defining the proper performance metrics.
-
In case of a new dataset;
inject_errors.py
anded2-feature_extraction.py
can be run consecutively on the dataset to generate a dirty version of it and their corresponding detections and feature vectors by ED2. An external error detector and a feature extractor can also be used instead of ED2. -
New cleaners should be implemented in
Cleaners.py
, and then added to theclean_errors()
method of RDCL class inRDCL.py
in the same format as other cleaners. -
Currently, the pipeline accepts LogisticRegression, LinearRegression and MLP network implemented in Tensorflow as predictor models. Any new models can be implemented in
RDCL.py
. Almost all essential performance metrics for classification and regression tasks are available in the pipeline already. In case a new one is intended to be added, e.g. Silhouette score, it can be incorporated in the last part oftrain_rl_cleaner()
of RDCL class. -
Loss function also can be tweaked in
loss_fn()
method of RDCL class.
Create a conda environment with required packages for the project:
conda rdcl_env create -f environment.yml
conda activate rdcl_env
Run the following command in the project directory;
on Windows and Linux:
python setup.py install
on Mac:
python -m pip install .