BigScience Evaluation

Code and data for the BigScience Evaluation WG.

Upcoming Milestones for Contributors

September 1, 2021: Eval Engineering Subgroup release toy tasks/dummy code to define API
September 1, 2021: New task-based subgroups established and begin work
October 1, 2021: Finalize GitHub with all data and scripts for generating raw evaluation results
October 15, 2021: General meeting to discuss longer research project proposals for fall/spring
October 15, 2021: Form subgroup on data presentation/visualization to create final report card

Quickstart

To benchmark a baseline GPT-2 model with WMT and TyDiQA datasets on GPU, run

python3 -m evaluation.eval \
    --model_name_or_path gpt2 \
    --eval_tasks wmt tydiqa_secondary \
    --device cuda \
    --output_dir outputs

Note: For toxicity dataset, you have to download the dataset manually from Kaggle here and also pass the data_dir argument to the folder.

Setup

Create virtual environment (one-time).

python3 -m venv venv # create a virtual environment called 'venv'

Activate the virtual environment.
```
source venv/bin/activate
```

Install package requirements.

python3 -m pip install -r requirements.txt
python3 -m pip install -r requirements-dev.txt

Tasks

This project plans to support all datasets listed under docs/datasets.md. The sections below detail task-independent inner-workings of this repository.

AutoTask

Every task/dataset lives as a submodule within evaluation.tasks. The core of these submodules inherit from evaluation.tasks.auto_task.AutoTask, which is a base class that houses all abstract functions, as well has holds model, tokenizer, and task_config as its attributes.

AutoTask makes it incredibly easy to load any dataset for a benchmark. The basic signature is

task = AutoTask.from_task_name(
    "task_name", model, tokenizer, device, english_only
)

Alternatively, if the model has to be recreated for each task, a task object can be created from string specifications.

task = AutoTask.from_spec(
    "task_name", 
    "model_name_or_path", 
    "tokenizer_name",
    device,
    english_only,
    data_dir: Optional
)

Evaluation

Every AutoTask subclass has a .evaluate() function wherein all evaluation logic resides, i.e. loading the dataset (and the dataloader, if necessary), and computing reporting metrics. At the end of the evaluation, metrics are saved as a class attribute in task.metrics. For more details on the full pipeline, refer to the main evaluation script, evaluation/eval.py.

Contributing

Refer to CONTRIBUTING.md.

Name		Name	Last commit message	Last commit date
Latest commit History 136 Commits
.github/workflows		.github/workflows
docs		docs
evaluation		evaluation
social-impact-group		social-impact-group
tests		tests
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BigScience Evaluation

Upcoming Milestones for Contributors

Quickstart

Setup

Tasks

AutoTask

Evaluation

Contributing

About

Contributors 13

Languages

License

bigscience-workshop/evaluation

Folders and files

Latest commit

History

Repository files navigation

BigScience Evaluation

Upcoming Milestones for Contributors

Quickstart

Setup

Tasks

AutoTask

Evaluation

Contributing

About

Resources

License

Stars

Watchers

Forks

Contributors 13

Languages