The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.
-
Updated
Sep 30, 2024 - Python
The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.
The open-source tool for building high-quality datasets and computer vision models
A light-weight, flexible, and expressive statistical data testing library
🚚 Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark
Prepping tables for machine learning
An open-source educational chat model from ICALK, East China Normal University. 开源中英教育对话大模型。(通用基座模型,GPU部署,数据清理) 致敬: LLaMA, MOSS, BELLE, Ziya, vLLM
Easy to use Python library of customized functions for cleaning and analyzing data.
The toolkit to test, validate, and evaluate your models and surface, curate, and prioritize the most valuable data for labeling.
Deal with bad samples in your dataset dynamically, use Transforms as Filters, and more!
🗣️ A book and repo to get you started programming voice computing applications in Python (10 chapters and 200+ scripts).
Pydantic extension for annotating autocorrecting fields.
LLM-based text extraction from unstructured data like PDFs, Words and HTMLs. Transform and cluster the text into your desired format. Less information loss, more interpretation, and faster R&D!
🤖 An automated machine learning framework for audio, text, image, video, or .CSV files (50+ featurizers and 15+ model trainers). Python 3.6 required.
🗺️ Data Cleaning and Textual Data Visualization 🗺️
A full data warehouse infrastructure with ETL pipelines running inside docker on Apache Airflow for data orchestration, AWS Redshift for cloud data warehouse and Metabase to serve the needs of data visualizations such as analytical dashboards.
pyDVL is a library of stable implementations of algorithms for data valuation and influence function computation
OpenDataVal: a Unified Benchmark for Data Valuation in Python (NeurIPS 2023)
Cleans Reddit Text Data 📜 🧹
🚢 Data Toolkit for Sailor Language Models
A Machine Learning System for Data Enrichment.
Add a description, image, and links to the data-cleaning topic page so that developers can more easily learn about it.
To associate your repository with the data-cleaning topic, visit your repo's landing page and select "manage topics."