Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running lm-evaluation-harness on 1 example but progress bar runs to 4 #1278

Closed
surya-narayanan opened this issue Jan 13, 2024 · 1 comment
Closed

Comments

@surya-narayanan
Copy link

Hi folks,

Thanks for a great tool. I want to run the evaluation harness on just 1 example, but the progress bar runs to 4, which I am confused by. Can you help me understand why, or point me to a relevant place in the documentation?

Example code followed by the output is here below

paperspace@psgwzz6bpkub:~$ lm_eval --model hf     --model_args pretrained=gpt2     --tasks sciq    --device cuda:0     --batch_size 1 --limit 1
2024-01-13:14:13:59,717 INFO     [utils.py:148] Note: NumExpr detected 16 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2024-01-13:14:13:59,717 INFO     [utils.py:160] NumExpr defaulting to 8 threads.
2024-01-13:14:13:59,953 INFO     [config.py:58] PyTorch version 2.0.1+cu117 available.
2024-01-13:14:13:59,954 INFO     [config.py:95] TensorFlow version 2.9.2 available.
2024-01-13:14:13:59,956 INFO     [config.py:108] JAX version 0.4.8 available.
2024-01-13:14:14:08,172 INFO     [__main__.py:156] Verbosity set to INFO
2024-01-13:14:14:12,073 WARNING  [__init__.py:178] Some tasks could not be loaded due to missing dependencies. Run with `--verbosity DEBUG` for full details.
2024-01-13:14:14:15,790 WARNING  [__init__.py:178] Some tasks could not be loaded due to missing dependencies. Run with `--verbosity DEBUG` for full details.
2024-01-13:14:14:15,790 WARNING  [__main__.py:162]  --limit SHOULD ONLY BE USED FOR TESTING.REAL METRICS SHOULD NOT BE COMPUTED USING LIMIT.
2024-01-13:14:14:15,791 INFO     [__main__.py:229] Selected Tasks: ['sciq']
2024-01-13:14:14:15,804 WARNING  [logging.py:61] Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
2024-01-13:14:14:15,804 INFO     [huggingface.py:146] Using device 'cuda:0'
2024-01-13:14:14:21,039 INFO     [task.py:337] Building contexts for task on rank 0...
2024-01-13:14:14:21,041 INFO     [evaluator.py:314] Running loglikelihood requests
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00,  5.21it/s]
@surya-narayanan
Copy link
Author

I think I understood - because there are four options, the harness splits each question + options set into a question x options tuple and evaluates each question + option pair, creating 4x the number of questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant