Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: add data quality tests in Giskard #1601

Closed
luca-martial opened this issue Nov 10, 2023 · 5 comments
Closed

Feature: add data quality tests in Giskard #1601

luca-martial opened this issue Nov 10, 2023 · 5 comments
Assignees
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@luca-martial
Copy link
Member

luca-martial commented Nov 10, 2023

🚀 Feature Request

Giskard currently focuses on model quality testing, but since ML models are heavily dependent on the data they are trained on, data quality testing is of high interest. We are looking to implement various data quality tests and are open to community contributions.

Examples of tests to add:

1. Data Completeness Test

  • Description: Checks for missing values in the dataset.
  • Implementation Hint: Calculate the percentage of missing values in each column and flag columns exceeding a threshold.

2. Data Uniqueness Test

  • Description: Ensures no duplicate entries in the dataset.
  • Implementation Hint: Detect and report duplicate rows or values in specified columns.

3. Data Range and Validity Test

  • Description: Ensures numerical or categorical data falls within expected ranges or sets of values.
  • Implementation Hint: Check if data values are within specified ranges or lists of valid values.

4. Data Correlation Test

  • Description: Analyzes correlations between different features.
  • Implementation Hint: Calculate and report the correlation matrix of the dataset's features.

5. Data Anomaly Detection Test

  • Description: Identifies outliers or anomalies in the dataset.
  • Implementation Hint: Use statistical methods or anomaly detection algorithms to flag significant deviations.

6. Data Integrity Test

  • Description: Ensures relationships between different data tables or datasets are maintained.
  • Implementation Hint: Check for foreign key relationships and cross-references.

7. Label Consistency Test

  • Description: Checks that labels are consistent and correctly assigned.
  • Implementation Hint: Audit and validate label assignments.

8. Class Imbalance Test

  • Description: Assesses the distribution of classes in classification problems.
  • Implementation Hint: Calculate and report the proportion of each class.

9. Feature Importance Test

  • Description: Evaluates the relevance of each feature to the target variable.
  • Implementation Hint: Use feature importance scores or coefficients to rank features.

10. Label Noise Detection Test

  • Description: Detects errors in the labeling of data.
  • Implementation Hint: Use anomaly detection or clustering to identify mislabeled data points.

🔈 Motivation

This will enhance the completeness of Giskard's testing capabilities.

@luca-martial luca-martial added enhancement New feature or request good first issue Good for newcomers labels Nov 10, 2023
@Kranium2002
Copy link
Contributor

Hi, I would like to work on this issue

@kevinmessiaen
Copy link
Member

Hi @Kranium2002

Thanks, I assigned you to the issue. Feel free to ask us if you have any question!

@Kranium2002
Copy link
Contributor

I had a question, should I work in giskard/utils and create a file for data quality tests for user? User will pass a pandas df and then the system will check its quality. How does this sound?

@kevinmessiaen
Copy link
Member

In order to have the tests to integrate smoothly with Giskard, it's better to use giskard.Dataset rather than pandas.DataFrame. You can access the df property of the Dataset in your test:

import giskard # You'll have to use relative import of used objects to prevent circular import issue

@giskard.test(name="My example data quality test")
def example_quality_test(dataset: giskard.Dataset, column: str, threshold: float=0.5):
    # Sample test that check if uniqueness ratio is greater than a threshold
    column = dataset.df[column]

    uniqueness = len(column.unique()) / len(column)

    return giskard.TestResult(passed=uniqueness > threshold)

# Trying my test
dataset = giskard.Dataset(pd.DataFrame({'test': [1, 2, 3, 2, 4, 1]}))

assert example_quality_test(dataset, 'test').execute().passed
assert not example_quality_test(dataset, 'test', 1).execute().passed
assert example_quality_test(dataset, 'test', 0).execute().passed

We organized Giskard so that tests are under giskard.testing.tests.

@Kranium2002 Kranium2002 mentioned this issue Dec 1, 2023
22 tasks
@Kranium2002
Copy link
Contributor

Working on this in #1651

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Development

No branches or pull requests

3 participants