When using Accelerate for data parallel inference, using different numbers of GPUs results in different results #1719

s1ghhh · 2024-04-18T08:35:53Z

Hi, @haileyschoelkopf Thank you for your awsome open-source work. We have been evaluating using lm-eval and noticed that when using accelerate for data parallel inference, the number of GPUs utilized leads to varying results. And the deviation between these results is greater than the stderr (about 0.012x).

We have conducted extensive evaluations on Winogrande using the same settings as the Open LLM Leaderboard, with num_fewshot=5 and batch_size=1.

Here are the results we obtained:

# of GPU	acc
1	0.7443
2	0.7419
3	0.7411
4	0.7269
5	0.7498
6	0.7530
7	0.7498
8	0.7443

Script for 5-shot inference with 1 GPUs:

CUDA_VISIBLE_DEVICES=0 accelerate launch -m lm_eval --model hf \
  --model_args pretrained=allenai/tulu-2-dpo-7b,trust_remote_code=True,dtype="bfloat16" \
  --tasks winogrande \
  --num_fewshot 5 \
  --batch_size 1

Script for 5-shot inference with 4 GPUs:

CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch -m lm_eval --model hf \
  --model_args pretrained=allenai/tulu-2-dpo-7b,trust_remote_code=True,dtype="bfloat16" \
  --tasks winogrande \
  --num_fewshot 5 \
  --batch_size 1

We believe this might be due to the num_fewshot. When we set num_fewshot=0, we obtain stable result: 0.6993.

Script for 0-shot inference with 1 GPUs:

CUDA_VISIBLE_DEVICES=0 accelerate launch -m lm_eval --model hf \
  --model_args pretrained=allenai/tulu-2-dpo-7b,trust_remote_code=True,dtype="bfloat16" \
  --tasks winogrande \
  --num_fewshot 0 \
  --batch_size 1

Script for 0-shot inference with 4 GPUs:

CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch -m lm_eval --model hf \
  --model_args pretrained=allenai/tulu-2-dpo-7b,trust_remote_code=True,dtype="bfloat16" \
  --tasks winogrande \
  --num_fewshot 0 \
  --batch_size 1

Our environments:

accelerate=0.27.2
transformers=4.36.2
lm_eval=0.4.0     commit 89618bf8421d27c8cf28004d616b33fc5b305ceb (HEAD -> main, origin/main, origin/HEAD)

Furthermore, we have evaluated on other servers and using the latest version, with similar observations.

Thank you in advance for your assistance!

The text was updated successfully, but these errors were encountered:

baberabb · 2024-04-18T18:55:24Z

It's probably because of #1308. So the fewshot samples used for a particular doc_id will vary depending on whether DP is used and the number of ranks. Best way to confirm would be to use a determinative sampler as in MMLU:

lm-evaluation-harness/lm_eval/tasks/mmlu/default/_default_template_yaml

Lines 4 to 5 in 3196e90

 fewshot_config: 

 sampler: first_n

s1ghhh · 2024-04-19T02:12:25Z

It's probably because of #1308. So the fewshot samples used for a particular doc_id will vary depending on whether DP is used and the number of ranks. Best way to confirm would be to use a determinative sampler as in MMLU:

lm-evaluation-harness/lm_eval/tasks/mmlu/default/_default_template_yaml

Lines 4 to 5 in 3196e90

fewshot_config:

sampler: first_n

Hi, thank you for your timely help, which was very helpful! By selecting first_n samples as few-shot examples, I am now able to obtain stable results. However, I've noticed that the results from the first_n strategy are lower than those from the previous random sampling strategy (0.7119 < 0.7443). Perhaps for some tasks, simply selecting the first_n items is not reasonable. The solution you mentioned in #1308 seems like a good approach, and I am trying to implement it.

fengzi258 · 2024-05-15T06:40:05Z

It's probably because of #1308. So the fewshot samples used for a particular doc_id will vary depending on whether DP is used and the number of ranks. Best way to confirm would be to use a determinative sampler as in MMLU:

lm-evaluation-harness/lm_eval/tasks/mmlu/default/_default_template_yaml

Lines 4 to 5 in 3196e90

fewshot_config:

sampler: first_n

Hi, thank you for your timely help, which was very helpful! By selecting first_n samples as few-shot examples, I am now able to obtain stable results. However, I've noticed that the results from the first_n strategy are lower than those from the previous random sampling strategy (0.7119 < 0.7443). Perhaps for some tasks, simply selecting the first_n items is not reasonable. The solution you mentioned in #1308 seems like a good approach, and I am trying to implement it.

Hi, have you implemented the approach mentioned in #1308? Can you share it?

s1ghhh · 2024-06-07T03:08:41Z

It's probably because of #1308. So the fewshot samples used for a particular doc_id will vary depending on whether DP is used and the number of ranks. Best way to confirm would be to use a determinative sampler as in MMLU:

lm-evaluation-harness/lm_eval/tasks/mmlu/default/_default_template_yaml

Lines 4 to 5 in 3196e90

fewshot_config:

sampler: first_n

Hi, thank you for your timely help, which was very helpful! By selecting first_n samples as few-shot examples, I am now able to obtain stable results. However, I've noticed that the results from the first_n strategy are lower than those from the previous random sampling strategy (0.7119 < 0.7443). Perhaps for some tasks, simply selecting the first_n items is not reasonable. The solution you mentioned in #1308 seems like a good approach, and I am trying to implement it.

Hi, have you implemented the approach mentioned in #1308? Can you share it?

Perhaps you can refer to this

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When using Accelerate for data parallel inference, using different numbers of GPUs results in different results #1719

When using Accelerate for data parallel inference, using different numbers of GPUs results in different results #1719

s1ghhh commented Apr 18, 2024 •

edited

Loading

baberabb commented Apr 18, 2024

s1ghhh commented Apr 19, 2024

fengzi258 commented May 15, 2024

s1ghhh commented Jun 7, 2024

When using Accelerate for data parallel inference, using different numbers of GPUs results in different results #1719

When using Accelerate for data parallel inference, using different numbers of GPUs results in different results #1719

Comments

s1ghhh commented Apr 18, 2024 • edited Loading

baberabb commented Apr 18, 2024

s1ghhh commented Apr 19, 2024

fengzi258 commented May 15, 2024

s1ghhh commented Jun 7, 2024

s1ghhh commented Apr 18, 2024 •

edited

Loading