Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sync with upstream #9

Merged
merged 121 commits into from
Jun 25, 2024
Merged

Sync with upstream #9

merged 121 commits into from
Jun 25, 2024

Conversation

SpirinEgor
Copy link
Member

No description provided.

haileyschoelkopf and others added 30 commits March 11, 2024 21:02
* add Arabic EXAMS benchmark

* fixed the linter issue, and add more information on the readme

* Update README.md

---------

Co-authored-by: Lintang Sutawika <[email protected]>
* add agieval

* fix typo

* add cloze / math exactmatch agieval tasks, rename

* update exact-match agieval tasks, allow for multiple-correct answers

* add more detail to readme

* don't parse_math_answer twice

---------

Co-authored-by: Alex Bäuerle <[email protected]>
* add manual tqdm disabling management

* add typing to all new args

* apply precommit changes

---------

Co-authored-by: haileyschoelkopf <[email protected]>
* Link to vllm integration

* add pip install .[vllm] cmd
* New tests for CLI args

* fix spacing

* change tests for parsing

* add tests, fix parser

* remove defaults for store_true
* Differentiate _encode_pair setting for decoder and enc-dec models

* tok_decode to not skip special token so that eos doen't become empty string

* Update model.py

* Update model.py

* Update huggingface.py

* Update lm_eval/models/huggingface.py

Co-authored-by: Hailey Schoelkopf <[email protected]>

* Update model.py

---------

Co-authored-by: Hailey Schoelkopf <[email protected]>
* Update interface.md

* fix: make caching reqs always work with accelerate launch

* remove stale task migration checklist

* remove deprecation warnings

* make informative TypeErrors for get_task_dict

* bump version metadata

* fix num_fewshot printing bug

* add fewshot value to cache key
* Fix eval_logger import for mmlu/_generate_configs.py

* linter

---------

Co-authored-by: Hailey Schoelkopf <[email protected]>
* use BOS token in loglikelihood

* improve comments

* add model arg

* log prefix token id

* log prefix token id

* Update lm_eval/api/model.py

Co-authored-by: Hailey Schoelkopf <[email protected]>

* change name to prefix_token_id

---------

Co-authored-by: Hailey Schoelkopf <[email protected]>
* make vllm use prefix_token_id ; have prefix_token_id be optional method to define

* custom_prefix_token_id wasn't set if not passed
* Add task ACLUE

* fix minor bug

* fix code style

* fix code style
* add logging of model args

* nit

* Add warnings.

* nit

* add warning

* nit
* peft Version Assertion

* fix the linter issue
* fix on --task list

* add fixes to tokeniation

* differentiate encoding for seq2seq and decoder

* return token setting

* format for pre-commit

* Seq2seq fix, pt2 (EleutherAI#1630)

* getting model class only when defined

* encode_pair handles None, add_special_tokens turned into dict with default value

---------

Co-authored-by: achervyakov <[email protected]>
…erAI#1598)

* Integration of NeMo models into LM Evaluation Harness library

* rename nemo model as nemo_lm

* move nemo section in readme after hf section

* use self.eot_token_id in get_until()

* improve progress bar showing loglikelihood requests

* data replication or tensor/pipeline replication working fine within one node

* run pre-commit on modified files

* check whether dependencies are installed

* clarify usage of torchrun in README
* add basqueglue

* add eus_exams

* add eus_proficiency

* add eus_reading

* add eus_trivia

* run pre-commit
…eutherAI#1656)

The OpenAI interface supports batch size as an argument to the completions API, but does not seem to support specification of this on the CLI i.e. `lm_eval --model openai-completions --batch_size 16 ...` because of a simple lack of str->int conversion.

This is confirmed by my usage and stacktrace from running `OPENAI_API_KEY=dummy lm_eval --model local-completions --tasks gsm8k --batch_size 16 --model_args model=nm-
testing/zephyr-beta-7b-gptq-g128,tokenizer_backend=huggingface,base_url=http:https://localhost:8000/v1`:
```
Traceback (most recent call last):
  File "/home/michael/venv/bin/lm_eval", line 8, in <module>
    sys.exit(cli_evaluate())
  File "/home/michael/code/lm-evaluation-harness/lm_eval/__main__.py", line 341, in cli_evaluate
    results = evaluator.simple_evaluate(
  File "/home/michael/code/lm-evaluation-harness/lm_eval/utils.py", line 288, in _wrapper
    return fn(*args, **kwargs)
  File "/home/michael/code/lm-evaluation-harness/lm_eval/evaluator.py", line 251, in simple_evaluate
    results = evaluate(
  File "/home/michael/code/lm-evaluation-harness/lm_eval/utils.py", line 288, in _wrapper
    return fn(*args, **kwargs)
  File "/home/michael/code/lm-evaluation-harness/lm_eval/evaluator.py", line 390, in evaluate
    resps = getattr(lm, reqtype)(cloned_reqs)
  File "/home/michael/code/lm-evaluation-harness/lm_eval/models/openai_completions.py", line 263, in generate_until
    list(sameuntil_chunks(re_ord.get_reordered(), self.batch_size)),
  File "/home/michael/code/lm-evaluation-harness/lm_eval/models/openai_completions.py", line 251, in sameuntil_chunks
    if len(ret) >= size or x[1] != lastuntil:
TypeError: '>=' not supported between instances of 'int' and 'str'
```
haileyschoelkopf and others added 25 commits June 7, 2024 13:13
* Update README.md

* Update bec.yaml

* Update bhtc.yaml

* Update coref.yaml

* Update qnli.yaml

* Update vaxx.yaml

* Update wic.yaml
* sort metrics in output table

* update docstring in `consolidate_results`

* add tests for verifying consistency of table output

* update tests to account for floating point inconsistencies

* updated tests based on `pythia-14m`
* results filenames handling moved to utils

* zeno results handling fix

* tasks_for_model backward compatibility

* results files logic moved to tasks_for_model

* moved sanitize_model_name to utils
* Update README.md

* Delete lm_eval/tasks/ammlu directory
Fix bug where `self.max_tokens` was not set
* `samples` is newline delimited

* updated git and pre-commit

* appease pre-commit

* nit

* Revert back for now

* Revert for now

---------

Co-authored-by: Lintang Sutawika <[email protected]>
…#1800)

* Update vllm_causallms.py

* adjust

---------

Co-authored-by: lintangsutawika <[email protected]>
…leutherAI#1956)

* fix: add filter to os.walk to ignore 'ipynb_checkpoints

* Update __init__.py

* Update __init__.py

---------

Co-authored-by: Lintang Sutawika <[email protected]>
* init paloma benchmark

* pre-process in utils function

* add `task_alias`

* updated task aliases

* Update paloma_dolma-v1_5.yaml

* Update paloma_twitterAAE_HELM_fixed.yaml

* Update paloma_dolma_100_programing_languages.yaml

---------

Co-authored-by: Lintang Sutawika <[email protected]>
* init paloma benchmark

* pre-process in utils function

* add `task_alias`

* updated task aliases

* Update paloma_dolma-v1_5.yaml

* Update paloma_twitterAAE_HELM_fixed.yaml

* Update paloma_dolma_100_programing_languages.yaml

* update on names

* fix paloma template issue

---------

Co-authored-by: Zafir Stojanovski <[email protected]>
Co-authored-by: Zafir Stojanovski <[email protected]>
Co-authored-by: Lintang Sutawika <[email protected]>
* log fewshot_as_multiturn in general tracker args

* Update evaluator.py

---------

Co-authored-by: Lintang Sutawika <[email protected]>
* Added ArabicMMLU

* Rename `ammlu` to `arabicmmlu`
* add bertaqa tasks

* rename basquetrivia-->bertaqa ; make template stub not .yaml

* add bertaqa entry to lm_eval/tasks/README.md

---------

Co-authored-by: haileyschoelkopf <[email protected]>
@SpirinEgor
Copy link
Member Author

SpirinEgor commented Jun 24, 2024

Check command:

accelerate launch -m lm_eval --model hf --batch_size auto \
    --tasks winogrande,arc_challenge,hellaswag,mmlu,gsm8k,truthfulqa_mc2  \
    --model_args pretrained=meta-llama/Meta-Llama-3-8B

Results compared to OpenLLM Leaderboard:

Task OpenLLM This update Main version
ARC 60.24 58.70 58.70
GSM8K 45.19 49.36 (flexible-extract), 48.60 (strict-match) 49.36 (flexible-extract), 48.60 (strict-match)
Hellaswag 82.23 81.94 81.94
MMLU 66.7 65.19 65.19
TruthfulQA 42.93 43.91 43.91
Winogrande 78.45 76.72 76.72

Quite comparable, changes probably due to different BS (8 vs 64) and some updates in tasks/prompts (OpenLLM uses version from June 23)

@Mogreine Mogreine self-requested a review June 25, 2024 10:09
@Mogreine Mogreine merged commit 7a89464 into main Jun 25, 2024
1 of 6 checks passed
@Mogreine Mogreine deleted the sync_with_upstream branch June 25, 2024 10:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet