accelerate doesn't work with auto:(>1) #1743

ozgurcelik · 2024-04-24T00:40:16Z

Hi. I realized that accelerate launch works perfectly when I set batch_size = "auto" but gets stuck at the very end when I use batch_size = "auto:2". The problem persists whether I use evaluator.simple_evaluate or terminal call accelerate launch -m lm_eval. This is a problem since different tasks may have different batch_sizes.

Passed argument batch_size = auto:2.0. Detecting largest batch size Running loglikelihood requests: 0%| | 0/1644 [00:00<?, ?it/s]Passed argument batch_size = auto:2.0. Detecting largest batch size Passed argument batch_size = auto:2.0. Detecting largest batch size Passed argument batch_size = auto:2.0. Detecting largest batch size Determined largest batch size: 16 Determined largest batch size: 16 Determined largest batch size: 16 Determined largest batch size: 16 Running loglikelihood requests: 40%|███████████████████████████████████████████▌ | 657/1644 [00:27<00:26, 37.06it/s]Passed argument batch_size = auto:2.0. Detecting largest batch size Passed argument batch_size = auto:2.0. Detecting largest batch size Running loglikelihood requests: 41%|████████████████████████████████████████████▌ | 673/1644 [00:28<00:26, 37.21it/s]Passed argument batch_size = auto:2.0. Detecting largest batch size Running loglikelihood requests: 42%|█████████████████████████████████████████████▋ | 689/1644 [00:28<00:25, 37.21it/s]Passed argument batch_size = auto:2.0. Detecting largest batch size Determined largest batch size: 16 Determined largest batch size: 16 Determined largest batch size: 16 Determined largest batch size: 16 Running loglikelihood requests: 90%|████████████████████████████████████████████████████████████████████████████████████████████████▏ | 1478/1644 [00:41<00:00, 209.33it/s]Passed argument batch_size = auto:2.0. Detecting largest batch size Running loglikelihood requests: 97%|████████████████████████████████████████████████████████████████████████████████████████████████████████ | 1598/1644 [00:41<00:00, 234.70it/s]Passed argument batch_size = auto:2.0. Detecting largest batch size Running loglikelihood requests: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1644/1644 [00:41<00:00, 39.17it/s] Map: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5000/5000 [00:01<00:00, 3971.36 examples/s] Map: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5000/5000 [00:01<00:00, 3402.27 examples/s]

This is how it looks like when it gets stuck with auto:2. It unnecessarily tries to find optimum batch size near very end and never finishes task.

The text was updated successfully, but these errors were encountered:

abzb1 · 2024-04-24T03:05:01Z

You can just use auto, and refer here https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/interface.md
auto:2 means search batch size twice 😅

ozgurcelik · 2024-04-24T03:36:46Z

Correct me if I am wrong but as we go down in the evaluation, the sample length may get shorter. So maybe we can fit more samples down the line. I was using auto:2 for such cases, precisely because I want to search the max batch size once again.

onebula · 2024-04-30T07:16:15Z

I also found this problem. After all loglikelihood requests are finished, the process hangs with no other outputs and CPU/GPU are full.
Mistral-7B-v0.1 on MMLU with auto:4 meets this problem, while on hellaswag with auto:4 not. Replace auto:4 with auto solves.
I believe there is a bug.

haileyschoelkopf · 2024-05-06T15:22:33Z

Hi! I'll look into this--suspect padding across ranks is slightly off somewhere, or else the batch sizes get unsynced.

@ozgurcelik --do you have a sample command which exhibits this problem?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

accelerate doesn't work with auto:(>1) #1743

accelerate doesn't work with auto:(>1) #1743

ozgurcelik commented Apr 24, 2024

abzb1 commented Apr 24, 2024

ozgurcelik commented Apr 24, 2024

onebula commented Apr 30, 2024 •

edited

Loading

haileyschoelkopf commented May 6, 2024

accelerate doesn't work with auto:(>1) #1743

accelerate doesn't work with auto:(>1) #1743

Comments

ozgurcelik commented Apr 24, 2024

abzb1 commented Apr 24, 2024

ozgurcelik commented Apr 24, 2024

onebula commented Apr 30, 2024 • edited Loading

haileyschoelkopf commented May 6, 2024

onebula commented Apr 30, 2024 •

edited

Loading