A plug-and-play algorithm that needs as little as possible human intervention to assess data quality.
Data is what guides today's decision-making process, it is at the center of modern institutions. But according to the saying: GIGO (Garbage In Garbage Out), bad data may have detrimental consequences on the company that used it. It is then of crucial order that the data be of the best quality possible. However, the process of cleaning the data usually relies on deterministic rules, which makes it hard, tedious, and time-consuming. Thus
AMIES
along with the companyFoyer
proposed achallenge
about the automation of the process. These plug-and-play algorithms are the results of our work during the challenge. As we are among thewinners
of the challenge we decided to publish the code and develop it in future work.
Assess data quality
is an open-source Python project which currently collects
- A plug-and-play algorithm with several strategies to detect "bad data" in a given data set.
Two different paths are suggested:
-
Fork the repository:
This will allow you to interact with the original repository (raise issues, get updates, propose pull requests, etc.) based on the fact that you'll share a common history.
-
Clone the repository:
Make sure you have git installed on the computer
cd <directory-of-the-choice> git clone https://github.com/<github-user-name>/assess_data_quality.git
test.py
that showcases the result of the algorithm applied to the test dataset data.csv
is available in the ./test
folder.
The dataset data.csv
(available in the ./test
folder) is a subset of a sales of agricultural machinery dataset.