-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Insights: EleutherAI/lm-evaluation-harness
Overview
Could not load contribution data
Please try again later
1 Release published by 1 person
-
v0.4.4
published
Sep 5, 2024
23 Pull requests merged by 10 people
-
openai: better error messages; fix greedy matching
#2327 merged
Sep 26, 2024 -
add mmlu readme
#2282 merged
Sep 26, 2024 -
Added TurkishMMLU to LM Evaluation Harness
#2283 merged
Sep 26, 2024 -
mmlu-pro: add newlines to task descriptions (not leaderboard)
#2334 merged
Sep 26, 2024 -
change glianorex to test split
#2332 merged
Sep 26, 2024 -
change group to tags in task
eus_exams
task configs#2320 merged
Sep 26, 2024 -
Treat tags in python tasks the same as yaml tasks
#2288 merged
Sep 26, 2024 -
fix writeout script
#2350 merged
Sep 26, 2024 -
squad v2: load metric with
evaluate
#2351 merged
Sep 26, 2024 -
Add a note for missing dependencies
#2336 merged
Sep 24, 2024 -
Fixed dummy model
#2339 merged
Sep 24, 2024 -
Update neuron backend
#2314 merged
Sep 18, 2024 -
remove comma
#2315 merged
Sep 17, 2024 -
Update README.md
#2297 merged
Sep 17, 2024 -
Multimodal prototyping
#2243 merged
Sep 13, 2024 -
Add Open Arabic LLM Leaderboard Benchmarks (Full and Light Version)
#2232 merged
Sep 10, 2024 -
Bump version to v0.4.4 ; Fixes to TMMLUplus
#2280 merged
Sep 5, 2024 -
Chat Template fix (cont. #2235)
#2269 merged
Sep 4, 2024 -
hotfix #2262
#2264 merged
Aug 30, 2024 -
API: fix maxlen; vllm: prefix_token_id bug
#2262 merged
Aug 30, 2024 -
Fix
loglikelihood_rolling
caching ( #1821 )#2187 merged
Aug 28, 2024 -
Update NLTK version in
*ifeval
tasks ( #2210 )#2259 merged
Aug 28, 2024 -
[Draft] More descriptive
simple_evaluate()
LM TypeError#2258 merged
Aug 28, 2024
16 Pull requests opened by 11 people
-
Ifeval: Dowload `punkt_tab` on rank 0
#2267 opened
Aug 30, 2024 -
Add Yue-Benchmark and update tasks description
#2270 opened
Aug 31, 2024 -
Nvidia TensorRT-LLM
#2271 opened
Sep 1, 2024 -
Gen Prefix
#2274 opened
Sep 2, 2024 -
fix some bugs of mmlu
#2299 opened
Sep 14, 2024 -
add new truncation strategy
#2300 opened
Sep 15, 2024 -
Fix missing key in custom task loading.
#2304 opened
Sep 16, 2024 -
avoid timeout errors with high concurrency in api_model
#2307 opened
Sep 16, 2024 -
Scrolls branch
#2309 opened
Sep 16, 2024 -
mmlu translated professionally by OpenAI
#2312 opened
Sep 17, 2024 -
Mathvista
#2321 opened
Sep 18, 2024 -
Fix float limit override
#2325 opened
Sep 19, 2024 -
Support pipeline parallel with OpenVINO models
#2349 opened
Sep 25, 2024 -
HF: switch conditional checks to `self.backend` from `AUTO_MODEL_CLASS`
#2353 opened
Sep 25, 2024 -
Add metabench task to LM Evaluation Harness
#2357 opened
Sep 26, 2024 -
fix `cost_estimate` script
#2359 opened
Sep 26, 2024
42 Issues closed by 13 people
-
Support for Using Multiple Choice Datasets with GPT-4o Model via OpenAI API
#2326 closed
Sep 26, 2024 -
Evaluation of MMLU tasks using the OpenAI API
#2318 closed
Sep 26, 2024 -
Issue with openai completions API - related to logprobs
#2287 closed
Sep 26, 2024 -
`glianorex_en` task does not work
#2329 closed
Sep 26, 2024 -
Tasks of type `python_task` are not listed in `lm-eval --tasks list`
#2268 closed
Sep 26, 2024 -
--tasks mmlu
#2355 closed
Sep 26, 2024 -
AttributeError: 'dict' object has no attribute 'has_test_docs'
#2342 closed
Sep 26, 2024 -
squadv2 task occurred "AttributeError: module 'datasets' has no attribute 'load_metric'"
#2348 closed
Sep 26, 2024 -
Question about IFEval on LeaderBoard
#2200 closed
Sep 24, 2024 -
Feature request: `4.4.0` Pypi release with `leaderboard`
#2195 closed
Sep 23, 2024 -
mmlu_pro fewshot_config
#2196 closed
Sep 23, 2024 -
Error of `continuation_logprobs_dicts` is `None` when running with `vllm` on multi-choice tasks
#2205 closed
Sep 23, 2024 -
Metrics that require probability scores (y_scores)
#2272 closed
Sep 23, 2024 -
the log is end,the gpu is not calculate,but is storing,the result is not getting,is it normal?
#2295 closed
Sep 23, 2024 -
Multi-node MMLU support ?
#2281 closed
Sep 23, 2024 -
External API - same results different models
#2284 closed
Sep 23, 2024 -
Using multi-GPU with accelerate is not working
#2292 closed
Sep 23, 2024 -
how to get lm_eval version 4.2
#2319 closed
Sep 18, 2024 -
Chat templates
#2308 closed
Sep 18, 2024 -
GSM8K Problem On Colab With Finetuned Phi3.5 mini model
#2316 closed
Sep 17, 2024 -
Comma breaks __repr__ for write-out
#2313 closed
Sep 17, 2024 -
What's going on with swde or squadv2 tasks ?
#2286 closed
Sep 12, 2024 -
Do the version of CMMLU and MMLU make any differences?
#2276 closed
Sep 10, 2024 -
May be parse LAST numbers in GSM8K "flexible-extract" filter?
#2278 closed
Sep 5, 2024 -
apply_chat_template got 'str' object is not callable
#2231 closed
Sep 4, 2024 -
Checkpointing Evaluation Results and Enable Resume of Evaluation
#2140 closed
Sep 3, 2024 -
fatal: not a git repository (or any parent up to mount point /kaggle)
#2263 closed
Sep 3, 2024 -
task_name=None issue when loading local dataset.
#2257 closed
Sep 3, 2024 -
How do I customize llama model architecture and run benchmark?
#2273 closed
Sep 2, 2024 -
Cannot evaluate on unitxt-related tasks such as unfair-tos, ledgar...
#2261 closed
Aug 30, 2024 -
local-completions error with dataset longer than max_length
#2253 closed
Aug 30, 2024 -
Why no results for closed-sourced models?
#2225 closed
Aug 29, 2024 -
API model: Evaluation fails when all samples are cached
#2141 closed
Aug 28, 2024 -
TypeError: 'NoneType' object is not iterable when using cache and loglikelihood_rolling
#1821 closed
Aug 28, 2024 -
nltk pickle
#2210 closed
Aug 28, 2024 -
Raising TypeError when using simple_evaluate()
#2254 closed
Aug 28, 2024 -
Missing Tasks in Leaderboard
#2204 closed
Aug 28, 2024 -
The results in the new and old versions differ from one another.
#2211 closed
Aug 28, 2024 -
What are `mmlu_continuation` and `mmlu_generative`?
#2255 closed
Aug 28, 2024
37 Issues opened by 33 people
-
Which filter value should be used among the accuracy test results?
#2362 opened
Sep 27, 2024 -
boolq trust remote code
#2361 opened
Sep 27, 2024 -
[multimodal] llava-1.5-7b-hf doesn't work on `mmmu_val`
#2360 opened
Sep 26, 2024 -
Add a test for `scripts/write_out.py` and other `scripts/` utils
#2356 opened
Sep 26, 2024 -
Evaluation of MMLU tasks using a fined tuned Gemma 2 model
#2354 opened
Sep 26, 2024 -
Setting limit_mm_per_prompt for vllm_vlm fails argument parser
#2352 opened
Sep 25, 2024 -
Unexpected space character
#2346 opened
Sep 25, 2024 -
tasks RACE only high not "middle"
#2345 opened
Sep 25, 2024 -
Reproduce QWen 2.5-14B-Instruct and LLaMa-3.1-8B-Instruct Results
#2344 opened
Sep 25, 2024 -
gpt2 evaluation
#2343 opened
Sep 24, 2024 -
Locally reproducible HF-Leaderboard evals
#2338 opened
Sep 24, 2024 -
Dynamical prompt with extremely promising results #RIPrompt
#2335 opened
Sep 23, 2024 -
Confusion over the model outputs
#2331 opened
Sep 23, 2024 -
Failed to add a new metric
#2330 opened
Sep 23, 2024 -
Hashing error when setting random seed for vllm model
#2328 opened
Sep 22, 2024 -
Bug in the float limit handling
#2324 opened
Sep 19, 2024 -
Error for AGIEval when using fewshot
#2323 opened
Sep 19, 2024 -
Which version to use
#2322 opened
Sep 19, 2024 -
Multiple generations (sequential) per question
#2317 opened
Sep 17, 2024 -
Running multiple processes on a shared outlines cache database
#2306 opened
Sep 16, 2024 -
New Task: `openai_mmmlu` professionaly translated by OpenAI as part of o1 release
#2305 opened
Sep 16, 2024 -
Missing key in dictionary when loading tasks.
#2303 opened
Sep 16, 2024 -
Configuring Azure OPENAI
#2302 opened
Sep 16, 2024 -
Fail to reproduce the perplexity of Llama-2 7B on wikitext
#2301 opened
Sep 15, 2024 -
Low GPU Utilization During Multi-GPU evaluation - Efficiency Optimization
#2296 opened
Sep 14, 2024 -
Worse evaluation performance with PEFT adaptors
#2294 opened
Sep 13, 2024 -
RuntimeError: CUDA error: device-side assert triggered
#2293 opened
Sep 12, 2024 -
Infer time by use library's external api is much longer than script
#2291 opened
Sep 11, 2024 -
Couldn't parse .yaml file for configuration
#2290 opened
Sep 11, 2024 -
A little typing issue
#2289 opened
Sep 10, 2024 -
Can we connect to Vertex AI model
#2285 opened
Sep 9, 2024 -
zero accuracy on `mmlu_generative`
#2279 opened
Sep 5, 2024 -
IFEval fails when multiple gpus are used (for DDP)
#2266 opened
Aug 30, 2024 -
Bug in Leaderboard IFEval Code
#2260 opened
Aug 29, 2024 -
Regarding metric chrf's implementation
#2256 opened
Aug 28, 2024
28 Unresolved conversations
Sometimes conversations happen on old items that aren’t yet closed. Here is a list of all the Issues and Pull Requests with unresolved conversations.
-
Add new benchmark: Catalan bench
#2154 commented on
Sep 27, 2024 • 10 new comments -
Add new benchmark: Galician bench
#2155 commented on
Sep 27, 2024 • 8 new comments -
Add new benchmark: Basque bench
#2153 commented on
Sep 27, 2024 • 5 new comments -
Add new benchmark: Spanish bench
#2157 commented on
Sep 27, 2024 • 4 new comments -
Minor features
#2249 commented on
Sep 14, 2024 • 3 new comments -
Draft - Support ov models via genai
#1862 commented on
Sep 3, 2024 • 3 new comments -
[Draft] llm-as-judge
#2251 commented on
Sep 25, 2024 • 1 new comment -
Add new benchmark: Portuguese bench
#2156 commented on
Sep 27, 2024 • 1 new comment -
[rank1]: huggingface_hub.utils._errors.HfHubHTTPError: 429 Client Error: Too Many Requests for url:
#2202 commented on
Aug 28, 2024 • 0 new comments -
TypeError: argument 'ids': 'NoneType' object cannot be converted to 'Sequence'
#2178 commented on
Aug 29, 2024 • 0 new comments -
Add KoCommonGEN v2 benchmark
#2208 commented on
Aug 28, 2024 • 0 new comments -
Evaluate Gemma with Chat Template
#2069 commented on
Sep 5, 2024 • 0 new comments -
Supporting Multimodality
#2014 commented on
Sep 5, 2024 • 0 new comments -
Allow Task objects to defer dataset download
#1558 commented on
Sep 9, 2024 • 0 new comments -
GPT2 eval in lambada_openai, acc only 0.325
#2159 commented on
Sep 9, 2024 • 0 new comments -
How to use Custom Prompt during Evaluation
#2131 commented on
Sep 12, 2024 • 0 new comments -
Medical specialities
#2113 commented on
Sep 18, 2024 • 0 new comments -
Chat template fix
#2058 commented on
Sep 11, 2024 • 0 new comments -
Fix partial caching of openai models
#1997 commented on
Aug 29, 2024 • 0 new comments -
Confusion matrix metric
#1921 commented on
Sep 8, 2024 • 0 new comments -
mlx Model (loglikelihood & generate_until)
#1902 commented on
Sep 9, 2024 • 0 new comments -
Low results on TriviaQA
#1292 commented on
Sep 17, 2024 • 0 new comments -
add context-based requests processing
#1571 commented on
Sep 6, 2024 • 0 new comments -
HellaSwag with UnicodeDecodeError
#1757 commented on
Sep 26, 2024 • 0 new comments -
Add long context evaluation benchmarks such as LongBench and LEval.
#2180 commented on
Sep 23, 2024 • 0 new comments -
eval gsm8k from local dataset folder with the bug info "ValueError: BuilderConfig 'main' not found."
#1829 commented on
Sep 23, 2024 • 0 new comments -
The response is too short to extract answer on GPQA. What should I set to extend it?
#2081 commented on
Sep 18, 2024 • 0 new comments -
Inconsistent evaluation results with Chat Template
#1841 commented on
Sep 17, 2024 • 0 new comments