Huggingface Datasets is an excellent library, but it lacks standardization, and datasets often require preprocessing work to be used interchangeably.
tasksource
streamlines interchangeable datasets usage to scale evaluation or multi-task learning.
Each dataset is standardized to a MultipleChoice
, Classification
, or TokenClassification
template with canonical fields. We focus on discriminative tasks (= with negative examples or classes) for our annotations but also provide a SequenceToSequence
template. All implemented preprocessings are in tasks.py or tasks.md. A preprocessing is a function that accepts a dataset and returns the standardized dataset. Preprocessing code is concise and human-readable.
pip install tasksource
from tasksource import list_tasks, load_task
df = list_tasks(multilingual=False) # takes some time
for id in df[df.task_type=="MultipleChoice"].id:
dataset = load_task(id) # all yielded datasets can be used interchangeably
Browse the 500+ curated tasks in tasks.md (200+ MultipleChoice tasks, 200+ Classification tasks), and feel free to request a new task. Datasets are downloaded to $HF_DATASETS_CACHE
(like any Hugging Face dataset), so ensure you have more than 100GB of space available.
You can now also use:
load_dataset("tasksource/data", "glue/rte",max_rows=30_000)
Text encoder pretrained on tasksource reached state-of-the-art results: 🤗/deberta-v3-base-tasksource-nli
Tasksource pretraining is notably helpful for RLHF reward modeling or any kind of classification, including zero-shot. You can also find a large and a multilingual version.
The repo also contains some recasting code to convert tasksource datasets to instructions, providing one of the richest instruction-tuning datasets: 🤗/tasksource-instruct-v0
We also recast all classification tasks as natural language inference, to improve entailment-based zero-shot classification detection: 🤗/zero-shot-label-nli
from tasksource import MultipleChoice
codah = MultipleChoice('question_propmt',choices_list='candidate_answers',
labels='correct_answer_idx',
dataset_name='codah', config_name='codah')
winogrande = MultipleChoice('sentence',['option1','option2'],'answer',
dataset_name='winogrande',config_name='winogrande_xl',
splits=['train','validation',None]) # test labels are not usable
tasks = [winogrande.load(), codah.load()]) # Aligned datasets (same columns) can be used interchangably
For more details, refer to this article:
@inproceedings{sileo-2024-tasksource,
title = "tasksource: A Large Collection of {NLP} tasks with a Structured Dataset Preprocessing Framework",
author = "Sileo, Damien",
booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
month = may,
year = "2024",
address = "Torino, Italia",
publisher = "ELRA and ICCL",
url = "https://aclanthology.org/2024.lrec-main.1361",
pages = "15655--15684",
}
For help integrating tasksource into your experiments, please contact [email protected].