Running lm-evaluation-harness on 1 example but progress bar runs to 4 #1278

surya-narayanan · 2024-01-13T14:17:00Z

Hi folks,

Thanks for a great tool. I want to run the evaluation harness on just 1 example, but the progress bar runs to 4, which I am confused by. Can you help me understand why, or point me to a relevant place in the documentation?

Example code followed by the output is here below

paperspace@psgwzz6bpkub:~$ lm_eval --model hf     --model_args pretrained=gpt2     --tasks sciq    --device cuda:0     --batch_size 1 --limit 1
2024-01-13:14:13:59,717 INFO     [utils.py:148] Note: NumExpr detected 16 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2024-01-13:14:13:59,717 INFO     [utils.py:160] NumExpr defaulting to 8 threads.
2024-01-13:14:13:59,953 INFO     [config.py:58] PyTorch version 2.0.1+cu117 available.
2024-01-13:14:13:59,954 INFO     [config.py:95] TensorFlow version 2.9.2 available.
2024-01-13:14:13:59,956 INFO     [config.py:108] JAX version 0.4.8 available.
2024-01-13:14:14:08,172 INFO     [__main__.py:156] Verbosity set to INFO
2024-01-13:14:14:12,073 WARNING  [__init__.py:178] Some tasks could not be loaded due to missing dependencies. Run with `--verbosity DEBUG` for full details.
2024-01-13:14:14:15,790 WARNING  [__init__.py:178] Some tasks could not be loaded due to missing dependencies. Run with `--verbosity DEBUG` for full details.
2024-01-13:14:14:15,790 WARNING  [__main__.py:162]  --limit SHOULD ONLY BE USED FOR TESTING.REAL METRICS SHOULD NOT BE COMPUTED USING LIMIT.
2024-01-13:14:14:15,791 INFO     [__main__.py:229] Selected Tasks: ['sciq']
2024-01-13:14:14:15,804 WARNING  [logging.py:61] Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
2024-01-13:14:14:15,804 INFO     [huggingface.py:146] Using device 'cuda:0'
2024-01-13:14:14:21,039 INFO     [task.py:337] Building contexts for task on rank 0...
2024-01-13:14:14:21,041 INFO     [evaluator.py:314] Running loglikelihood requests
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00,  5.21it/s]

The text was updated successfully, but these errors were encountered:

surya-narayanan · 2024-01-14T04:34:56Z

I think I understood - because there are four options, the harness splits each question + options set into a question x options tuple and evaluates each question + option pair, creating 4x the number of questions.

surya-narayanan closed this as completed Jan 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running lm-evaluation-harness on 1 example but progress bar runs to 4 #1278

Running lm-evaluation-harness on 1 example but progress bar runs to 4 #1278

surya-narayanan commented Jan 13, 2024

surya-narayanan commented Jan 14, 2024

Running lm-evaluation-harness on 1 example but progress bar runs to 4 #1278

Running lm-evaluation-harness on 1 example but progress bar runs to 4 #1278

Comments

surya-narayanan commented Jan 13, 2024

surya-narayanan commented Jan 14, 2024