Skip to content

Commit

Permalink
Merge branch 'main' into chat_template
Browse files Browse the repository at this point in the history
  • Loading branch information
KonradSzafer committed May 22, 2024
2 parents 8a0ce59 + 70e1de0 commit 9bd948d
Show file tree
Hide file tree
Showing 84 changed files with 1,063 additions and 24 deletions.
8 changes: 4 additions & 4 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ Welcome to the docs for the LM Evaluation Harness!

## Table of Contents

* To learn about the public interface of the library, as well as how to evaluate via the commandline or as integrated into an external library, see the [Interface](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/docs/interface.md)
* To learn how to add a new library, API, or model type to the library, as well as a quick explainer on the types of ways to evaluate an LM, see the [Model Guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/docs/model_guide.md).
* For a crash course on adding new tasks to the library, see our [New Task Guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/docs/new_task_guide.md).
* To learn more about pushing the limits of task configuration that the Eval Harness supports, see the [Task Configuration Guide](https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/docs/task_guide.md).
* To learn about the public interface of the library, as well as how to evaluate via the commandline or as integrated into an external library, see the [Interface](./interface.md)
* To learn how to add a new library, API, or model type to the library, as well as a quick explainer on the types of ways to evaluate an LM, see the [Model Guide](./model_guide.md).
* For a crash course on adding new tasks to the library, see our [New Task Guide](./new_task_guide.md).
* To learn more about pushing the limits of task configuration that the Eval Harness supports, see the [Task Configuration Guide](./task_guide.md).
13 changes: 10 additions & 3 deletions docs/interface.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ This mode supports a number of command-line arguments, the details of which can

- `--show_config` : If used, prints the full `lm_eval.api.task.TaskConfig` contents (non-default settings the task YAML file) for each task which was run, at the completion of an evaluation. Useful for when one is modifying a task's configuration YAML locally to transmit the exact configurations used for debugging or for reproducibility purposes.

- `--include_path` : Accepts a path to a folder. If passed, then all YAML files containing ` lm-eval`` compatible task configurations will be added to the task registry as available tasks. Used for when one is writing config files for their own task in a folder other than `lm_eval/tasks/`
- `--include_path` : Accepts a path to a folder. If passed, then all YAML files containing `lm-eval` compatible task configurations will be added to the task registry as available tasks. Used for when one is writing config files for their own task in a folder other than `lm_eval/tasks/`.

- `--system_instruction`: Specifies a system instruction string to prepend to the prompt.

Expand All @@ -54,7 +54,14 @@ This mode supports a number of command-line arguments, the details of which can

* `--seed`: Set seed for python's random, numpy and torch. Accepts a comma-separated list of 3 values for python's random, numpy, and torch seeds, respectively, or a single integer to set the same seed for all three. The values are either an integer or 'None' to not set the seed. Default is `0,1234,1234` (for backward compatibility). E.g. `--seed 0,None,8` sets `random.seed(0)` and `torch.manual_seed(8)`. Here numpy's seed is not set since the second value is `None`. E.g, `--seed 42` sets all three seeds to 42.

* `--wandb_args`: Tracks logging to Weights and Biases for evaluation runs and includes args passed to `wandb.init`, such as `project` and `job_type`. Full list (here.)[https://docs.wandb.ai/ref/python/init]. e.g., ```--wandb_args project=test-project,name=test-run```
* `--wandb_args`: Tracks logging to Weights and Biases for evaluation runs and includes args passed to `wandb.init`, such as `project` and `job_type`. Full list [here](https://docs.wandb.ai/ref/python/init). e.g., ```--wandb_args project=test-project,name=test-run```

* `--hf_hub_log_args` : Logs evaluation results to Hugging Face Hub. Accepts a string with the arguments separated by commas. Available arguments:
* `hub_results_org` - organization name on Hugging Face Hub, e.g., `EleutherAI`,
* `hub_repo_name` - repository name on Hugging Face Hub, e.g., `lm-eval-results`,
* `push_results_to_hub` - whether to push results to Hugging Face Hub, can be `True` or `False`,
* `push_samples_to_hub` - whether to push samples results to Hugging Face Hub, can be `True` or `False`. Requires `--log_samples` to be set,
* `public_repo` - whether the repository is public, can be `True` or `False`,

## External Library Usage

Expand Down Expand Up @@ -83,7 +90,7 @@ task_manager = lm_eval.tasks.TaskManager()

# Setting `task_manager` to the one above is optional and should generally be done
# if you want to include tasks from paths other than ones in `lm_eval/tasks`.
# `simple_evaluate` will instantiate its own task_manager is the it is set to None here.
# `simple_evaluate` will instantiate its own task_manager if it is set to None here.
results = lm_eval.simple_evaluate( # call simple_evaluate
model=lm_obj,
tasks=["taskname1", "taskname2"],
Expand Down
2 changes: 1 addition & 1 deletion docs/model_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ In order to properly evaluate a given LM, we require implementation of a wrapper

## Setup

To get started contributing, go ahead and fork the main repo, clone it, create a branch with the name of your task, and install the project requirements in your environment:
To get started contributing, go ahead and fork the main repo, clone it, create a branch with the name of your model, and install the project requirements in your environment:

```sh
# After forking...
Expand Down
6 changes: 2 additions & 4 deletions docs/new_task_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -172,7 +172,7 @@ doc_to_target: "{{answer}}"
```


**Important**: we now add `target_delimiter` between input and target which defaults to " ", such that the full input-output string is `doc_to_target(doc) + target_delimiter + doc_to_text(doc)`. doc_to_text and doc_to_target should not contain trailing right or left whitespace, respectively.
**Important**: we now add `target_delimiter` between input and target which defaults to " ", such that the full input-output string is `doc_to_target(doc) + target_delimiter + doc_to_text(doc)`. `doc_to_text` and `doc_to_target` should not contain trailing right or left whitespace, respectively.


#### Multiple choice format
Expand Down Expand Up @@ -366,9 +366,7 @@ task:

## Beautifying Table Display

To avoid conflict, each task needs to be registered with a unique name. Because of this, slight variations of task are still counted as unique tasks and need to be named uniquely. This could be done by appending an additional naming that may refer to the variation such as in MMLU where the template used to evaluated for flan are differentiated from the default by the prefix `mmlu_flan_*`. Printing the full task names can easily clutter the results table at the end of the evaluation especially when you have a long list of tasks or are using a benchmark that comprises of many tasks. To make it more legible, you can use `task_alias` and `group_alias` to provide an alternative task name and group name that will be printed.
``
for example in `mmlu_abstract_algebra.yaml` we set `group_alias` to `stem` and `task_alias` to `abstract_algebra`.
To avoid conflict, each task needs to be registered with a unique name. Because of this, slight variations of task are still counted as unique tasks and need to be named uniquely. This could be done by appending an additional naming that may refer to the variation such as in MMLU where the template used to evaluated for flan are differentiated from the default by the prefix `mmlu_flan_*`. Printing the full task names can easily clutter the results table at the end of the evaluation especially when you have a long list of tasks or are using a benchmark that comprises of many tasks. To make it more legible, you can use `task_alias` and `group_alias` to provide an alternative task name and group name that will be printed. For example in `mmlu_abstract_algebra.yaml` we set `group_alias` to `stem` and `task_alias` to `abstract_algebra`.

```
"dataset_name": "abstract_algebra"
Expand Down
4 changes: 2 additions & 2 deletions docs/task_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,8 +31,8 @@ Dataset configuration options:
Prompting / in-context formatting options:
- **use_prompt** (`str`, *optional*) — Name of prompt in promptsource to use. if defined, will overwrite doc_to_text, doc_to_target, and doc_to_choice.
- **description** (`str`, *optional*) — An optional prepended Jinja2 template or string which will be prepended to the few-shot examples passed into the model, often describing the task or providing instructions to a model, such as `"The following are questions (with answers) about {{subject}}.\n\n"`. No delimiters or spacing are inserted between the description and the first few-shot example.
- **doc_to_text** (`Union[Callable, str]`, *optional*) — Jinja2 template, string, or function to process a sample into the appropriate input for the model
- **doc_to_target** (`Union[Callable, str]`, *optional*) — Jinja2 template, string, or function to process a sample into the appropriate target output for the model. For multiple choice tasks, this should return an index into
- **doc_to_text** (`Union[Callable, str]`, *optional*) — Jinja2 template, string, or function to process a sample into the appropriate input for the model.
- **doc_to_target** (`Union[Callable, str]`, *optional*) — Jinja2 template, string, or function to process a sample into the appropriate target output for the model. For multiple choice tasks, this should return an index into the answer choice list of the correct answer.
- **doc_to_choice** (`Union[Callable, str]`, *optional*) — Jinja2 template, string, or function to process a sample into a list of possible string choices for `multiple_choice` tasks. Left undefined for `generate_until` tasks.
- **fewshot_delimiter** (`str`, *optional*, defaults to "\n\n") — String to insert between few-shot examples.
- **target_delimiter** (`str`, *optional*, defaults to `" "`) — String to insert between input and target output for the datapoint being tested.
Expand Down
21 changes: 13 additions & 8 deletions lm_eval/models/huggingface.py
Original file line number Diff line number Diff line change
Expand Up @@ -199,6 +199,15 @@ def __init__(
config=self.config, backend=backend, trust_remote_code=trust_remote_code
)

# load tokenizer so we know tokenizer vocabulary size before loading model and PEFT
self._create_tokenizer(
pretrained,
tokenizer,
revision=revision,
trust_remote_code=trust_remote_code,
use_fast_tokenizer=use_fast_tokenizer,
)

# if we passed `pretrained` as a string, initialize our model now
if isinstance(pretrained, str):
self._create_model(
Expand Down Expand Up @@ -235,14 +244,6 @@ def __init__(
"Failed to place model onto specified device. This may be because the model is quantized via `bitsandbytes` or `device_map` is provided. If the desired GPU is being used, this message is safe to ignore."
)

self._create_tokenizer(
pretrained,
tokenizer,
revision=revision,
trust_remote_code=trust_remote_code,
use_fast_tokenizer=use_fast_tokenizer,
)

self.truncation = truncation
self.logits_cache = logits_cache
self.vocab_size = self.tokenizer.vocab_size
Expand Down Expand Up @@ -579,6 +580,10 @@ def _create_model(
if model_kwargs.get("load_in_4bit", None):
if version.parse(PEFT_VERSION) < version.parse("0.4.0"):
raise AssertionError("load_in_4bit requires peft >= 0.4.0")
if self._model.config.vocab_size != len(self.tokenizer):
# resize model for LoRAs with added tokens
self._model.resize_token_embeddings(len(self.tokenizer))
eval_logger.info(f"Model config indicates vocab_size='{self._model.config.vocab_size}', but found tokenizer with vocab size '{len(self.tokenizer)}'. Resizing model embedding layer...")
self._model = PeftModel.from_pretrained(
self._model, peft, revision=revision
)
Expand Down
4 changes: 3 additions & 1 deletion lm_eval/tasks/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -413,7 +413,9 @@ def get_task_dict(
)

string_task_name_list = [task for task in task_name_list if isinstance(task, str)]
others_task_name_list = [task for task in task_name_list if ~isinstance(task, str)]
others_task_name_list = [
task for task in task_name_list if not isinstance(task, str)
]
if len(string_task_name_list) > 0:
if task_manager is None:
task_manager = TaskManager()
Expand Down
47 changes: 47 additions & 0 deletions lm_eval/tasks/copal_id/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# COPAL

### Paper

Title: `COPAL-ID: Indonesian Language Reasoning with Local Culture and Nuances`

Abstract: `https://arxiv.org/abs/2311.01012`

`COPAL-ID is an Indonesian causal commonsense reasoning dataset that captures local nuances. It provides a more natural portrayal of day-to-day causal reasoning within the Indonesian (especially Jakartan) cultural sphere. Professionally written and validatid from scratch by natives, COPAL-ID is more fluent and free from awkward phrases, unlike the translated XCOPA-ID.`

Homepage: `https://github.com/haryoa/copal-id`


### Citation

```
@article{wibowo2023copal,
title={COPAL-ID: Indonesian Language Reasoning with Local Culture and Nuances},
author={Wibowo, Haryo Akbarianto and Fuadi, Erland Hilman and Nityasya, Made Nindyatama and Prasojo, Radityo Eko and Aji, Alham Fikri},
journal={arXiv preprint arXiv:2311.01012},
year={2023}
}
```

### Groups and Tasks

#### Groups

* `copal_id`

#### Tasks

* `copal_id_standard`: `Standard version of COPAL dataset, use formal language and less local nuances`
* `copal_id_colloquial`: `Colloquial version of COPAL dataset, use informal language and more local nuances`

### Checklist

For adding novel benchmarks/datasets to the library:
* [x] Is the task an existing benchmark in the literature?
* [x] Have you referenced the original paper that introduced the task?
* [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?


If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
4 changes: 4 additions & 0 deletions lm_eval/tasks/copal_id/colloquial.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
include: standard.yaml
task: copal_id_colloquial
task_alias: colloquial
test_split: test_colloquial
14 changes: 14 additions & 0 deletions lm_eval/tasks/copal_id/standard.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
group: copal_id
task: copal_id_standard
task_alias: standard
dataset_path: haryoaw/COPAL
dataset_name: id
output_type: multiple_choice
test_split: test
doc_to_text: !function utils.doc_to_text_id
doc_to_target: label
doc_to_choice: !function utils.doc_to_choice
metric_list:
- metric: acc
metadata:
version: 1.0
23 changes: 23 additions & 0 deletions lm_eval/tasks/copal_id/utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
from functools import partial


def convert_choice(choice):
return choice[0].lower() + choice[1:]


def doc_to_text(doc, connector):
conn = connector[doc["question"]]
return doc["premise"].strip()[:-1] + f" {conn}"


def doc_to_choice(doc):
return [convert_choice(doc["choice1"]), convert_choice(doc["choice2"])]


doc_to_text_id = partial(
doc_to_text,
connector={
"cause": "karena",
"effect": "maka",
},
)
11 changes: 11 additions & 0 deletions lm_eval/tasks/mmlu/continuation/_continuation_template_yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
dataset_path: hails/mmlu_no_train # a copy of `cais/mmlu` with no auxiliary_train split
output_type: multiple_choice
test_split: test
fewshot_split: dev
fewshot_config:
sampler: first_n
doc_to_text: "Question: {{question.strip()}}\nAnswer:"
doc_to_choice: "{{choices}}"
doc_to_target: "{{answer}}"
metadata:
version: 0.0
6 changes: 6 additions & 0 deletions lm_eval/tasks/mmlu/continuation/_mmlu.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
group: mmlu_continuation
task:
- mmlu_continuation_stem
- mmlu_continuation_other
- mmlu_continuation_social_sciences
- mmlu_continuation_humanities
6 changes: 6 additions & 0 deletions lm_eval/tasks/mmlu/continuation/mmlu_abstract_algebra.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
"dataset_name": "abstract_algebra"
"description": "The following are questions (with answers) about abstract\
\ algebra.\n\n"
"group": "mmlu_continuation_stem"
"include": "_continuation_template_yaml"
"task": "mmlu_continuation_abstract_algebra"
6 changes: 6 additions & 0 deletions lm_eval/tasks/mmlu/continuation/mmlu_anatomy.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
"dataset_name": "anatomy"
"description": "The following are questions (with answers) about anatomy.\n\
\n"
"group": "mmlu_continuation_stem"
"include": "_continuation_template_yaml"
"task": "mmlu_continuation_anatomy"
6 changes: 6 additions & 0 deletions lm_eval/tasks/mmlu/continuation/mmlu_astronomy.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
"dataset_name": "astronomy"
"description": "The following are questions (with answers) about astronomy.\n\
\n"
"group": "mmlu_continuation_stem"
"include": "_continuation_template_yaml"
"task": "mmlu_continuation_astronomy"
6 changes: 6 additions & 0 deletions lm_eval/tasks/mmlu/continuation/mmlu_business_ethics.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
"dataset_name": "business_ethics"
"description": "The following are questions (with answers) about business\
\ ethics.\n\n"
"group": "mmlu_continuation_other"
"include": "_continuation_template_yaml"
"task": "mmlu_continuation_business_ethics"
6 changes: 6 additions & 0 deletions lm_eval/tasks/mmlu/continuation/mmlu_clinical_knowledge.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
"dataset_name": "clinical_knowledge"
"description": "The following are questions (with answers) about clinical\
\ knowledge.\n\n"
"group": "mmlu_continuation_other"
"include": "_continuation_template_yaml"
"task": "mmlu_continuation_clinical_knowledge"
6 changes: 6 additions & 0 deletions lm_eval/tasks/mmlu/continuation/mmlu_college_biology.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
"dataset_name": "college_biology"
"description": "The following are questions (with answers) about college\
\ biology.\n\n"
"group": "mmlu_continuation_stem"
"include": "_continuation_template_yaml"
"task": "mmlu_continuation_college_biology"
6 changes: 6 additions & 0 deletions lm_eval/tasks/mmlu/continuation/mmlu_college_chemistry.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
"dataset_name": "college_chemistry"
"description": "The following are questions (with answers) about college\
\ chemistry.\n\n"
"group": "mmlu_continuation_stem"
"include": "_continuation_template_yaml"
"task": "mmlu_continuation_college_chemistry"
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
"dataset_name": "college_computer_science"
"description": "The following are questions (with answers) about college\
\ computer science.\n\n"
"group": "mmlu_continuation_stem"
"include": "_continuation_template_yaml"
"task": "mmlu_continuation_college_computer_science"
6 changes: 6 additions & 0 deletions lm_eval/tasks/mmlu/continuation/mmlu_college_mathematics.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
"dataset_name": "college_mathematics"
"description": "The following are questions (with answers) about college\
\ mathematics.\n\n"
"group": "mmlu_continuation_stem"
"include": "_continuation_template_yaml"
"task": "mmlu_continuation_college_mathematics"
6 changes: 6 additions & 0 deletions lm_eval/tasks/mmlu/continuation/mmlu_college_medicine.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
"dataset_name": "college_medicine"
"description": "The following are questions (with answers) about college\
\ medicine.\n\n"
"group": "mmlu_continuation_other"
"include": "_continuation_template_yaml"
"task": "mmlu_continuation_college_medicine"
6 changes: 6 additions & 0 deletions lm_eval/tasks/mmlu/continuation/mmlu_college_physics.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
"dataset_name": "college_physics"
"description": "The following are questions (with answers) about college\
\ physics.\n\n"
"group": "mmlu_continuation_stem"
"include": "_continuation_template_yaml"
"task": "mmlu_continuation_college_physics"
6 changes: 6 additions & 0 deletions lm_eval/tasks/mmlu/continuation/mmlu_computer_security.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
"dataset_name": "computer_security"
"description": "The following are questions (with answers) about computer\
\ security.\n\n"
"group": "mmlu_continuation_stem"
"include": "_continuation_template_yaml"
"task": "mmlu_continuation_computer_security"
6 changes: 6 additions & 0 deletions lm_eval/tasks/mmlu/continuation/mmlu_conceptual_physics.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
"dataset_name": "conceptual_physics"
"description": "The following are questions (with answers) about conceptual\
\ physics.\n\n"
"group": "mmlu_continuation_stem"
"include": "_continuation_template_yaml"
"task": "mmlu_continuation_conceptual_physics"
6 changes: 6 additions & 0 deletions lm_eval/tasks/mmlu/continuation/mmlu_econometrics.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
"dataset_name": "econometrics"
"description": "The following are questions (with answers) about econometrics.\n\
\n"
"group": "mmlu_continuation_social_sciences"
"include": "_continuation_template_yaml"
"task": "mmlu_continuation_econometrics"
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
"dataset_name": "electrical_engineering"
"description": "The following are questions (with answers) about electrical\
\ engineering.\n\n"
"group": "mmlu_continuation_stem"
"include": "_continuation_template_yaml"
"task": "mmlu_continuation_electrical_engineering"
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
"dataset_name": "elementary_mathematics"
"description": "The following are questions (with answers) about elementary\
\ mathematics.\n\n"
"group": "mmlu_continuation_stem"
"include": "_continuation_template_yaml"
"task": "mmlu_continuation_elementary_mathematics"
Loading

0 comments on commit 9bd948d

Please sign in to comment.