Feature: add data quality tests in Giskard #1601

luca-martial · 2023-11-10T18:17:28Z

🚀 Feature Request

Giskard currently focuses on model quality testing, but since ML models are heavily dependent on the data they are trained on, data quality testing is of high interest. We are looking to implement various data quality tests and are open to community contributions.

Examples of tests to add:

1. Data Completeness Test

Description: Checks for missing values in the dataset.
Implementation Hint: Calculate the percentage of missing values in each column and flag columns exceeding a threshold.

2. Data Uniqueness Test

Description: Ensures no duplicate entries in the dataset.
Implementation Hint: Detect and report duplicate rows or values in specified columns.

3. Data Range and Validity Test

Description: Ensures numerical or categorical data falls within expected ranges or sets of values.
Implementation Hint: Check if data values are within specified ranges or lists of valid values.

4. Data Correlation Test

Description: Analyzes correlations between different features.
Implementation Hint: Calculate and report the correlation matrix of the dataset's features.

5. Data Anomaly Detection Test

Description: Identifies outliers or anomalies in the dataset.
Implementation Hint: Use statistical methods or anomaly detection algorithms to flag significant deviations.

6. Data Integrity Test

Description: Ensures relationships between different data tables or datasets are maintained.
Implementation Hint: Check for foreign key relationships and cross-references.

7. Label Consistency Test

Description: Checks that labels are consistent and correctly assigned.
Implementation Hint: Audit and validate label assignments.

8. Class Imbalance Test

Description: Assesses the distribution of classes in classification problems.
Implementation Hint: Calculate and report the proportion of each class.

9. Feature Importance Test

Description: Evaluates the relevance of each feature to the target variable.
Implementation Hint: Use feature importance scores or coefficients to rank features.

10. Label Noise Detection Test

Description: Detects errors in the labeling of data.
Implementation Hint: Use anomaly detection or clustering to identify mislabeled data points.

🔈 Motivation

This will enhance the completeness of Giskard's testing capabilities.

Kranium2002 · 2023-11-28T10:19:40Z

Hi, I would like to work on this issue

kevinmessiaen · 2023-11-28T13:32:25Z

Hi @Kranium2002

Thanks, I assigned you to the issue. Feel free to ask us if you have any question!

Kranium2002 · 2023-11-30T14:08:32Z

I had a question, should I work in giskard/utils and create a file for data quality tests for user? User will pass a pandas df and then the system will check its quality. How does this sound?

kevinmessiaen · 2023-11-30T14:32:59Z

In order to have the tests to integrate smoothly with Giskard, it's better to use giskard.Dataset rather than pandas.DataFrame. You can access the df property of the Dataset in your test:

import giskard # You'll have to use relative import of used objects to prevent circular import issue

@giskard.test(name="My example data quality test")
def example_quality_test(dataset: giskard.Dataset, column: str, threshold: float=0.5):
    # Sample test that check if uniqueness ratio is greater than a threshold
    column = dataset.df[column]

    uniqueness = len(column.unique()) / len(column)

    return giskard.TestResult(passed=uniqueness > threshold)

# Trying my test
dataset = giskard.Dataset(pd.DataFrame({'test': [1, 2, 3, 2, 4, 1]}))

assert example_quality_test(dataset, 'test').execute().passed
assert not example_quality_test(dataset, 'test', 1).execute().passed
assert example_quality_test(dataset, 'test', 0).execute().passed

We organized Giskard so that tests are under giskard.testing.tests.

Kranium2002 · 2023-12-01T11:00:11Z

Working on this in #1651

luca-martial added enhancement New feature or request good first issue Good for newcomers labels Nov 10, 2023

kevinmessiaen assigned Kranium2002 Nov 28, 2023

Kranium2002 mentioned this issue Dec 1, 2023

Add data quality tests #1651

Merged

22 tasks

kevinmessiaen closed this as completed Dec 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: add data quality tests in Giskard #1601

Feature: add data quality tests in Giskard #1601

luca-martial commented Nov 10, 2023 •

edited

Loading

Kranium2002 commented Nov 28, 2023

kevinmessiaen commented Nov 28, 2023

Kranium2002 commented Nov 30, 2023

kevinmessiaen commented Nov 30, 2023

Kranium2002 commented Dec 1, 2023

Feature: add data quality tests in Giskard #1601

Feature: add data quality tests in Giskard #1601

Comments

luca-martial commented Nov 10, 2023 • edited Loading

🚀 Feature Request

🔈 Motivation

Kranium2002 commented Nov 28, 2023

kevinmessiaen commented Nov 28, 2023

Kranium2002 commented Nov 30, 2023

kevinmessiaen commented Nov 30, 2023

Kranium2002 commented Dec 1, 2023

luca-martial commented Nov 10, 2023 •

edited

Loading