forked from EleutherAI/lm-evaluation-harness
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sync with upstream #9
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
* add Arabic EXAMS benchmark * fixed the linter issue, and add more information on the readme * Update README.md --------- Co-authored-by: Lintang Sutawika <[email protected]>
* add agieval * fix typo * add cloze / math exactmatch agieval tasks, rename * update exact-match agieval tasks, allow for multiple-correct answers * add more detail to readme * don't parse_math_answer twice --------- Co-authored-by: Alex Bäuerle <[email protected]>
* add manual tqdm disabling management * add typing to all new args * apply precommit changes --------- Co-authored-by: haileyschoelkopf <[email protected]>
* Link to vllm integration * add pip install .[vllm] cmd
* New tests for CLI args * fix spacing * change tests for parsing * add tests, fix parser * remove defaults for store_true
* Differentiate _encode_pair setting for decoder and enc-dec models * tok_decode to not skip special token so that eos doen't become empty string * Update model.py * Update model.py * Update huggingface.py * Update lm_eval/models/huggingface.py Co-authored-by: Hailey Schoelkopf <[email protected]> * Update model.py --------- Co-authored-by: Hailey Schoelkopf <[email protected]>
* Update interface.md * fix: make caching reqs always work with accelerate launch * remove stale task migration checklist * remove deprecation warnings * make informative TypeErrors for get_task_dict * bump version metadata * fix num_fewshot printing bug * add fewshot value to cache key
* Fix eval_logger import for mmlu/_generate_configs.py * linter --------- Co-authored-by: Hailey Schoelkopf <[email protected]>
* use BOS token in loglikelihood * improve comments * add model arg * log prefix token id * log prefix token id * Update lm_eval/api/model.py Co-authored-by: Hailey Schoelkopf <[email protected]> * change name to prefix_token_id --------- Co-authored-by: Hailey Schoelkopf <[email protected]>
…herAI#1601) This reverts commit b7923a8.
* make vllm use prefix_token_id ; have prefix_token_id be optional method to define * custom_prefix_token_id wasn't set if not passed
* Add task ACLUE * fix minor bug * fix code style * fix code style
* add logging of model args * nit * Add warnings. * nit * add warning * nit
* peft Version Assertion * fix the linter issue
* fix on --task list * add fixes to tokeniation * differentiate encoding for seq2seq and decoder * return token setting * format for pre-commit * Seq2seq fix, pt2 (EleutherAI#1630) * getting model class only when defined * encode_pair handles None, add_special_tokens turned into dict with default value --------- Co-authored-by: achervyakov <[email protected]>
…erAI#1598) * Integration of NeMo models into LM Evaluation Harness library * rename nemo model as nemo_lm * move nemo section in readme after hf section * use self.eot_token_id in get_until() * improve progress bar showing loglikelihood requests * data replication or tensor/pipeline replication working fine within one node * run pre-commit on modified files * check whether dependencies are installed * clarify usage of torchrun in README
* add basqueglue * add eus_exams * add eus_proficiency * add eus_reading * add eus_trivia * run pre-commit
…eutherAI#1656) The OpenAI interface supports batch size as an argument to the completions API, but does not seem to support specification of this on the CLI i.e. `lm_eval --model openai-completions --batch_size 16 ...` because of a simple lack of str->int conversion. This is confirmed by my usage and stacktrace from running `OPENAI_API_KEY=dummy lm_eval --model local-completions --tasks gsm8k --batch_size 16 --model_args model=nm- testing/zephyr-beta-7b-gptq-g128,tokenizer_backend=huggingface,base_url=http:https://localhost:8000/v1`: ``` Traceback (most recent call last): File "/home/michael/venv/bin/lm_eval", line 8, in <module> sys.exit(cli_evaluate()) File "/home/michael/code/lm-evaluation-harness/lm_eval/__main__.py", line 341, in cli_evaluate results = evaluator.simple_evaluate( File "/home/michael/code/lm-evaluation-harness/lm_eval/utils.py", line 288, in _wrapper return fn(*args, **kwargs) File "/home/michael/code/lm-evaluation-harness/lm_eval/evaluator.py", line 251, in simple_evaluate results = evaluate( File "/home/michael/code/lm-evaluation-harness/lm_eval/utils.py", line 288, in _wrapper return fn(*args, **kwargs) File "/home/michael/code/lm-evaluation-harness/lm_eval/evaluator.py", line 390, in evaluate resps = getattr(lm, reqtype)(cloned_reqs) File "/home/michael/code/lm-evaluation-harness/lm_eval/models/openai_completions.py", line 263, in generate_until list(sameuntil_chunks(re_ord.get_reordered(), self.batch_size)), File "/home/michael/code/lm-evaluation-harness/lm_eval/models/openai_completions.py", line 251, in sameuntil_chunks if len(ret) >= size or x[1] != lastuntil: TypeError: '>=' not supported between instances of 'int' and 'str' ```
* Update README.md * Update bec.yaml * Update bhtc.yaml * Update coref.yaml * Update qnli.yaml * Update vaxx.yaml * Update wic.yaml
* sort metrics in output table * update docstring in `consolidate_results` * add tests for verifying consistency of table output * update tests to account for floating point inconsistencies * updated tests based on `pythia-14m`
* results filenames handling moved to utils * zeno results handling fix * tasks_for_model backward compatibility * results files logic moved to tasks_for_model * moved sanitize_model_name to utils
* Update README.md * Delete lm_eval/tasks/ammlu directory
Fix bug where `self.max_tokens` was not set
* `samples` is newline delimited * updated git and pre-commit * appease pre-commit * nit * Revert back for now * Revert for now --------- Co-authored-by: Lintang Sutawika <[email protected]>
…#1800) * Update vllm_causallms.py * adjust --------- Co-authored-by: lintangsutawika <[email protected]>
Co-authored-by: lintangsutawika <[email protected]>
…leutherAI#1956) * fix: add filter to os.walk to ignore 'ipynb_checkpoints * Update __init__.py * Update __init__.py --------- Co-authored-by: Lintang Sutawika <[email protected]>
Signed-off-by: changwangss <[email protected]>
* init paloma benchmark * pre-process in utils function * add `task_alias` * updated task aliases * Update paloma_dolma-v1_5.yaml * Update paloma_twitterAAE_HELM_fixed.yaml * Update paloma_dolma_100_programing_languages.yaml --------- Co-authored-by: Lintang Sutawika <[email protected]>
* init paloma benchmark * pre-process in utils function * add `task_alias` * updated task aliases * Update paloma_dolma-v1_5.yaml * Update paloma_twitterAAE_HELM_fixed.yaml * Update paloma_dolma_100_programing_languages.yaml * update on names * fix paloma template issue --------- Co-authored-by: Zafir Stojanovski <[email protected]> Co-authored-by: Zafir Stojanovski <[email protected]> Co-authored-by: Lintang Sutawika <[email protected]>
* log fewshot_as_multiturn in general tracker args * Update evaluator.py --------- Co-authored-by: Lintang Sutawika <[email protected]>
* Added ArabicMMLU * Rename `ammlu` to `arabicmmlu`
* add bertaqa tasks * rename basquetrivia-->bertaqa ; make template stub not .yaml * add bertaqa entry to lm_eval/tasks/README.md --------- Co-authored-by: haileyschoelkopf <[email protected]>
Check command: accelerate launch -m lm_eval --model hf --batch_size auto \
--tasks winogrande,arc_challenge,hellaswag,mmlu,gsm8k,truthfulqa_mc2 \
--model_args pretrained=meta-llama/Meta-Llama-3-8B Results compared to OpenLLM Leaderboard:
Quite comparable, changes probably due to different BS (8 vs 64) and some updates in tasks/prompts (OpenLLM uses version from June 23) |
Mogreine
approved these changes
Jun 25, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.