Sync with upstream #9

SpirinEgor · 2024-06-24T14:18:38Z

No description provided.

* add Arabic EXAMS benchmark * fixed the linter issue, and add more information on the readme * Update README.md --------- Co-authored-by: Lintang Sutawika <[email protected]>

* add agieval * fix typo * add cloze / math exactmatch agieval tasks, rename * update exact-match agieval tasks, allow for multiple-correct answers * add more detail to readme * don't parse_math_answer twice --------- Co-authored-by: Alex Bäuerle <[email protected]>

…AI#1563)

* add manual tqdm disabling management * add typing to all new args * apply precommit changes --------- Co-authored-by: haileyschoelkopf <[email protected]>

* Link to vllm integration * add pip install .[vllm] cmd

* New tests for CLI args * fix spacing * change tests for parsing * add tests, fix parser * remove defaults for store_true

* Differentiate _encode_pair setting for decoder and enc-dec models * tok_decode to not skip special token so that eos doen't become empty string * Update model.py * Update model.py * Update huggingface.py * Update lm_eval/models/huggingface.py Co-authored-by: Hailey Schoelkopf <[email protected]> * Update model.py --------- Co-authored-by: Hailey Schoelkopf <[email protected]>

* Update interface.md * fix: make caching reqs always work with accelerate launch * remove stale task migration checklist * remove deprecation warnings * make informative TypeErrors for get_task_dict * bump version metadata * fix num_fewshot printing bug * add fewshot value to cache key

* Fix eval_logger import for mmlu/_generate_configs.py * linter --------- Co-authored-by: Hailey Schoelkopf <[email protected]>

* use BOS token in loglikelihood * improve comments * add model arg * log prefix token id * log prefix token id * Update lm_eval/api/model.py Co-authored-by: Hailey Schoelkopf <[email protected]> * change name to prefix_token_id --------- Co-authored-by: Hailey Schoelkopf <[email protected]>

…herAI#1601) This reverts commit b7923a8.

* make vllm use prefix_token_id ; have prefix_token_id be optional method to define * custom_prefix_token_id wasn't set if not passed

* Add task ACLUE * fix minor bug * fix code style * fix code style

…AI#1612)

* add logging of model args * nit * Add warnings. * nit * add warning * nit

* peft Version Assertion * fix the linter issue

* fix on --task list * add fixes to tokeniation * differentiate encoding for seq2seq and decoder * return token setting * format for pre-commit * Seq2seq fix, pt2 (EleutherAI#1630) * getting model class only when defined * encode_pair handles None, add_special_tokens turned into dict with default value --------- Co-authored-by: achervyakov <[email protected]>

…erAI#1598) * Integration of NeMo models into LM Evaluation Harness library * rename nemo model as nemo_lm * move nemo section in readme after hf section * use self.eot_token_id in get_until() * improve progress bar showing loglikelihood requests * data replication or tensor/pipeline replication working fine within one node * run pre-commit on modified files * check whether dependencies are installed * clarify usage of torchrun in README

…leutherAI#1647)

* add basqueglue * add eus_exams * add eus_proficiency * add eus_reading * add eus_trivia * run pre-commit

…eutherAI#1656) The OpenAI interface supports batch size as an argument to the completions API, but does not seem to support specification of this on the CLI i.e. `lm_eval --model openai-completions --batch_size 16 ...` because of a simple lack of str->int conversion. This is confirmed by my usage and stacktrace from running `OPENAI_API_KEY=dummy lm_eval --model local-completions --tasks gsm8k --batch_size 16 --model_args model=nm- testing/zephyr-beta-7b-gptq-g128,tokenizer_backend=huggingface,base_url=http:https://localhost:8000/v1`: ``` Traceback (most recent call last): File "/home/michael/venv/bin/lm_eval", line 8, in <module> sys.exit(cli_evaluate()) File "/home/michael/code/lm-evaluation-harness/lm_eval/__main__.py", line 341, in cli_evaluate results = evaluator.simple_evaluate( File "/home/michael/code/lm-evaluation-harness/lm_eval/utils.py", line 288, in _wrapper return fn(*args, **kwargs) File "/home/michael/code/lm-evaluation-harness/lm_eval/evaluator.py", line 251, in simple_evaluate results = evaluate( File "/home/michael/code/lm-evaluation-harness/lm_eval/utils.py", line 288, in _wrapper return fn(*args, **kwargs) File "/home/michael/code/lm-evaluation-harness/lm_eval/evaluator.py", line 390, in evaluate resps = getattr(lm, reqtype)(cloned_reqs) File "/home/michael/code/lm-evaluation-harness/lm_eval/models/openai_completions.py", line 263, in generate_until list(sameuntil_chunks(re_ord.get_reordered(), self.batch_size)), File "/home/michael/code/lm-evaluation-harness/lm_eval/models/openai_completions.py", line 251, in sameuntil_chunks if len(ret) >= size or x[1] != lastuntil: TypeError: '>=' not supported between instances of 'int' and 'str' ```

* Update README.md * Update bec.yaml * Update bhtc.yaml * Update coref.yaml * Update qnli.yaml * Update vaxx.yaml * Update wic.yaml

* sort metrics in output table * update docstring in `consolidate_results` * add tests for verifying consistency of table output * update tests to account for floating point inconsistencies * updated tests based on `pythia-14m`

…folder (EleutherAI#1940)

* results filenames handling moved to utils * zeno results handling fix * tasks_for_model backward compatibility * results files logic moved to tasks_for_model * moved sanitize_model_name to utils

* Update README.md * Delete lm_eval/tasks/ammlu directory

…EleutherAI#1856)

Fix bug where `self.max_tokens` was not set

* `samples` is newline delimited * updated git and pre-commit * appease pre-commit * nit * Revert back for now * Revert for now --------- Co-authored-by: Lintang Sutawika <[email protected]>

…#1800) * Update vllm_causallms.py * adjust --------- Co-authored-by: lintangsutawika <[email protected]>

Co-authored-by: lintangsutawika <[email protected]>

…leutherAI#1956) * fix: add filter to os.walk to ignore 'ipynb_checkpoints * Update __init__.py * Update __init__.py --------- Co-authored-by: Lintang Sutawika <[email protected]>

Signed-off-by: changwangss <[email protected]>

* init paloma benchmark * pre-process in utils function * add `task_alias` * updated task aliases * Update paloma_dolma-v1_5.yaml * Update paloma_twitterAAE_HELM_fixed.yaml * Update paloma_dolma_100_programing_languages.yaml --------- Co-authored-by: Lintang Sutawika <[email protected]>

* init paloma benchmark * pre-process in utils function * add `task_alias` * updated task aliases * Update paloma_dolma-v1_5.yaml * Update paloma_twitterAAE_HELM_fixed.yaml * Update paloma_dolma_100_programing_languages.yaml * update on names * fix paloma template issue --------- Co-authored-by: Zafir Stojanovski <[email protected]> Co-authored-by: Zafir Stojanovski <[email protected]> Co-authored-by: Lintang Sutawika <[email protected]>

* log fewshot_as_multiturn in general tracker args * Update evaluator.py --------- Co-authored-by: Lintang Sutawika <[email protected]>

* Added ArabicMMLU * Rename `ammlu` to `arabicmmlu`

* add bertaqa tasks * rename basquetrivia-->bertaqa ; make template stub not .yaml * add bertaqa entry to lm_eval/tasks/README.md --------- Co-authored-by: haileyschoelkopf <[email protected]>

SpirinEgor · 2024-06-24T22:30:12Z

Check command:

accelerate launch -m lm_eval --model hf --batch_size auto \
    --tasks winogrande,arc_challenge,hellaswag,mmlu,gsm8k,truthfulqa_mc2  \
    --model_args pretrained=meta-llama/Meta-Llama-3-8B

Results compared to OpenLLM Leaderboard:

Task	OpenLLM	This update	Main version
ARC	60.24	58.70	58.70
GSM8K	45.19	49.36 (`flexible-extract`), 48.60 (`strict-match`)	49.36 (`flexible-extract`), 48.60 (`strict-match`)
Hellaswag	82.23	81.94	81.94
MMLU	66.7	65.19	65.19
TruthfulQA	42.93	43.91	43.91
Winogrande	78.45	76.72	76.72

Quite comparable, changes probably due to different BS (8 vs 64) and some updates in tasks/prompts (OpenLLM uses version from June 23)

haileyschoelkopf and others added 30 commits March 11, 2024 21:02

Update generate_until_template_yaml (EleutherAI#1546)

a79a7c3

Update ifeval.yaml (EleutherAI#1506)

282b9e7

add Arabic EXAMS benchmark (EleutherAI#1498)

4ab0759

* add Arabic EXAMS benchmark * fixed the linter issue, and add more information on the readme * Update README.md --------- Co-authored-by: Lintang Sutawika <[email protected]>

cli_evaluate calls simple_evaluate with the same verbosity. (Eleuther…

49695e8

…AI#1563)

add manual tqdm disabling management (EleutherAI#1569)

e74ec96

* add manual tqdm disabling management * add typing to all new args * apply precommit changes --------- Co-authored-by: haileyschoelkopf <[email protected]>

Fix README section on vllm integration (EleutherAI#1579)

7d9922c

* Link to vllm integration * add pip install .[vllm] cmd

Fix Jinja template for Advanced AI Risk (EleutherAI#1587)

dc90fec

Proposed approach for testing CLI arg parsing (EleutherAI#1566)

92f30af

* New tests for CLI args * fix spacing * change tests for parsing * add tests, fix parser * remove defaults for store_true

Add start date in results.json (EleutherAI#1592)

6fae67a

Fix eval_logger import for mmlu/_generate_configs.py (EleutherAI#1593)

4600d6b

* Fix eval_logger import for mmlu/_generate_configs.py * linter --------- Co-authored-by: Hailey Schoelkopf <[email protected]>

Revert "Patch for Seq2Seq Model predictions (EleutherAI#1584)" (Eleut…

f871646

…herAI#1601) This reverts commit b7923a8.

fix gen_kwargs arg reading (EleutherAI#1607)

0d920e8

fix until arg processing (EleutherAI#1608)

d4b8fc1

Fixes to Loglikelihood prefix token / VLLM (EleutherAI#1611)

c7b03ad

* make vllm use prefix_token_id ; have prefix_token_id be optional method to define * custom_prefix_token_id wasn't set if not passed

Add ACLUE task (EleutherAI#1614)

6554690

* Add task ACLUE * fix minor bug * fix code style * fix code style

OpenAI Completions -- fix passing of unexpected 'until' arg (Eleuther…

34c9b7e

…AI#1612)

add logging of model args (EleutherAI#1619)

cffc1bd

* add logging of model args * nit * Add warnings. * nit * add warning * nit

Add vLLM FAQs to README (EleutherAI#1625) (EleutherAI#1633)

a97fde2

peft Version Assertion (EleutherAI#1635)

8e72f26

* peft Version Assertion * fix the linter issue

Fix conditional import for Nemo LM class (EleutherAI#1641)

0dffdbb

Fix SuperGlue's ReCoRD task following regression in v0.4 refactoring (E…

ab7cc6b

…leutherAI#1647)

Add Latxa paper evaluation tasks for Basque (EleutherAI#1654)

c2c8e23

* add basqueglue * add eus_exams * add eus_proficiency * add eus_reading * add eus_trivia * run pre-commit

Patch QQP prompt (EleutherAI#1661)

ff24e99

haileyschoelkopf and others added 25 commits June 7, 2024 13:13

Update siqa.yaml (EleutherAI#1909)

3f0ef80

Update basque-glue (EleutherAI#1913)

59418aa

* Update README.md * Update bec.yaml * Update bhtc.yaml * Update coref.yaml * Update qnli.yaml * Update vaxx.yaml * Update wic.yaml

Test output table layout consistency (EleutherAI#1916)

40f5458

* sort metrics in output table * update docstring in `consolidate_results` * add tests for verifying consistency of table output * update tests to account for floating point inconsistencies * updated tests based on `pythia-14m`

Update __main__.py (EleutherAI#1939)

bea1a85

Add the Arabic version with refactor to Arabic pica to be in alghafa …

305fb63

…folder (EleutherAI#1940)

Results filenames handling fix (EleutherAI#1926)

6995258

* results filenames handling moved to utils * zeno results handling fix * tasks_for_model backward compatibility * results files logic moved to tasks_for_model * moved sanitize_model_name to utils

Remove AMMLU Due to Translation (EleutherAI#1948)

d0f6e01

* Update README.md * Delete lm_eval/tasks/ammlu directory

add include_defaults kwarg to taskmanager, add tests for include_path (…

4bb77e8

…EleutherAI#1856)

add hacky add_bos_token forcing for Gemma to VLLM too (EleutherAI#1857)

b3e4c49

Update interface.md (EleutherAI#1955)

6f43493

Fix self.max_tokens in anthropic_llms.py (EleutherAI#1848)

793469e

Fix bug where `self.max_tokens` was not set

samples is newline delimited (EleutherAI#1930)

3850e21

* `samples` is newline delimited * updated git and pre-commit * appease pre-commit * nit * Revert back for now * Revert for now --------- Co-authored-by: Lintang Sutawika <[email protected]>

Fix --gen_kwargs and VLLM (temperature not respected) (EleutherAI…

5c7cba2

…#1800) * Update vllm_causallms.py * adjust --------- Co-authored-by: lintangsutawika <[email protected]>

make write_out.py explicitly error if no splits match (EleutherAI#1796)

ed72238

Co-authored-by: lintangsutawika <[email protected]>

fix: add directory filter to os.walk to ignore 'ipynb_checkpoints' (E…

568af94

…leutherAI#1956) * fix: add filter to os.walk to ignore 'ipynb_checkpoints * Update __init__.py * Update __init__.py --------- Co-authored-by: Lintang Sutawika <[email protected]>

add trust_remote_code for piqa (EleutherAI#1983)

72bb624

Signed-off-by: changwangss <[email protected]>

Fix self assignment in neuron_optimum.py (EleutherAI#1990)

bdb78d2

Log fewshot_as_multiturn in results files (EleutherAI#1995)

78a54e1

* log fewshot_as_multiturn in general tracker args * Update evaluator.py --------- Co-authored-by: Lintang Sutawika <[email protected]>

Added ArabicMMLU (EleutherAI#1987)

a08bc3c

* Added ArabicMMLU * Rename `ammlu` to `arabicmmlu`

Fix Datasets --trust_remote_code (EleutherAI#1998)

d14b36e

Add BertaQA dataset tasks (EleutherAI#1964)

6f7b4a0

* add bertaqa tasks * rename basquetrivia-->bertaqa ; make template stub not .yaml * add bertaqa entry to lm_eval/tasks/README.md --------- Co-authored-by: haileyschoelkopf <[email protected]>

Merge remote-tracking branch 'upstream/main' into sync_with_upstream

0645bbd

Fix precommit hook, update run_models.sh

c5f43b4

Mogreine self-requested a review June 25, 2024 10:09

Mogreine approved these changes Jun 25, 2024

View reviewed changes

Mogreine merged commit 7a89464 into main Jun 25, 2024
1 of 6 checks passed

Mogreine deleted the sync_with_upstream branch June 25, 2024 10:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync with upstream #9

Sync with upstream #9

SpirinEgor commented Jun 24, 2024

SpirinEgor commented Jun 24, 2024 •

edited

Loading

Sync with upstream #9

Sync with upstream #9

Conversation

SpirinEgor commented Jun 24, 2024

SpirinEgor commented Jun 24, 2024 • edited Loading

SpirinEgor commented Jun 24, 2024 •

edited

Loading