Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

accelerate doesn't work with auto:(>1) #1743

Open
ozgurcelik opened this issue Apr 24, 2024 · 4 comments
Open

accelerate doesn't work with auto:(>1) #1743

ozgurcelik opened this issue Apr 24, 2024 · 4 comments

Comments

@ozgurcelik
Copy link

Hi. I realized that accelerate launch works perfectly when I set batch_size = "auto" but gets stuck at the very end when I use batch_size = "auto:2". The problem persists whether I use evaluator.simple_evaluate or terminal call accelerate launch -m lm_eval. This is a problem since different tasks may have different batch_sizes.

Passed argument batch_size = auto:2.0. Detecting largest batch size Running loglikelihood requests: 0%| | 0/1644 [00:00<?, ?it/s]Passed argument batch_size = auto:2.0. Detecting largest batch size Passed argument batch_size = auto:2.0. Detecting largest batch size Passed argument batch_size = auto:2.0. Detecting largest batch size Determined largest batch size: 16 Determined largest batch size: 16 Determined largest batch size: 16 Determined largest batch size: 16 Running loglikelihood requests: 40%|███████████████████████████████████████████▌ | 657/1644 [00:27<00:26, 37.06it/s]Passed argument batch_size = auto:2.0. Detecting largest batch size Passed argument batch_size = auto:2.0. Detecting largest batch size Running loglikelihood requests: 41%|████████████████████████████████████████████▌ | 673/1644 [00:28<00:26, 37.21it/s]Passed argument batch_size = auto:2.0. Detecting largest batch size Running loglikelihood requests: 42%|█████████████████████████████████████████████▋ | 689/1644 [00:28<00:25, 37.21it/s]Passed argument batch_size = auto:2.0. Detecting largest batch size Determined largest batch size: 16 Determined largest batch size: 16 Determined largest batch size: 16 Determined largest batch size: 16 Running loglikelihood requests: 90%|████████████████████████████████████████████████████████████████████████████████████████████████▏ | 1478/1644 [00:41<00:00, 209.33it/s]Passed argument batch_size = auto:2.0. Detecting largest batch size Running loglikelihood requests: 97%|████████████████████████████████████████████████████████████████████████████████████████████████████████ | 1598/1644 [00:41<00:00, 234.70it/s]Passed argument batch_size = auto:2.0. Detecting largest batch size Running loglikelihood requests: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1644/1644 [00:41<00:00, 39.17it/s] Map: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5000/5000 [00:01<00:00, 3971.36 examples/s] Map: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5000/5000 [00:01<00:00, 3402.27 examples/s]

This is how it looks like when it gets stuck with auto:2. It unnecessarily tries to find optimum batch size near very end and never finishes task.

@abzb1
Copy link

abzb1 commented Apr 24, 2024

You can just use auto, and refer here https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/interface.md
auto:2 means search batch size twice 😅

@ozgurcelik
Copy link
Author

Correct me if I am wrong but as we go down in the evaluation, the sample length may get shorter. So maybe we can fit more samples down the line. I was using auto:2 for such cases, precisely because I want to search the max batch size once again.

@onebula
Copy link

onebula commented Apr 30, 2024

I also found this problem. After all loglikelihood requests are finished, the process hangs with no other outputs and CPU/GPU are full.
Mistral-7B-v0.1 on MMLU with auto:4 meets this problem, while on hellaswag with auto:4 not. Replace auto:4 with auto solves.
I believe there is a bug.

@haileyschoelkopf
Copy link
Contributor

Hi! I'll look into this--suspect padding across ranks is slightly off somewhere, or else the batch sizes get unsynced.

@ozgurcelik --do you have a sample command which exhibits this problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants