For some models and prompts, the loglikelihood changes with the batch size. #704

yeoedward · 2023-07-25T16:38:45Z

I think this problem is in the underlying transformers library, but I'm creating an issue here to document the behavior as it results in inconsistent evaluation scores. This issue was encountered in #695.

To reproduce:

model: pretrained=EleutherAI/pythia-160m
context: The SWAT team moved in on the compound to prevent the terrorists from launching a deadly missile because the terrorists
continuation: were trying to terrorize the global population.

import lm_eval
import lm_eval.api.registry
import lm_eval.models.huggingface
from lm_eval.api.instance import Instance

lm = lm_eval.api.registry.get_model('hf').create_from_arg_string(
  'pretrained=EleutherAI/pythia-160m',
  {
    "batch_size": 32,
    "max_batch_size": None,
    "device": "cuda",
  },
)
req = Instance(
    request_type='loglikelihood',
    arguments=("The SWAT team moved in on the compound to prevent the terrorists from launching a deadly missile because the terrorists", " were trying to terrorize the global population."),
    doc=0,
    idx=0,
    repeats=1,
)

A batch of four requests

lm.loglikelihood([req, req, req, req])

Returns

[(-21.84375, False),
 (-21.84375, False),
 (-21.84375, False),
 (-21.84375, False)]

While a batch of five requests

lm.loglikelihood([req, req, req, req, req]

Returns

[(-21.765625, False),
 (-21.765625, False),
 (-21.765625, False),
 (-21.765625, False),
 (-21.765625, False)]

The same issue is also present for

model: pretrained=facebook/opt-125m
context: Bush beat Gore because Gore
continuation: was unpopular.

The problem also manifests when running the following command with different tasks:

python main.py --model hf --model_args pretrained=EleutherAI/pythia-160m --tasks $TASKS --batch_size 32

When
TASKS=xwinograd_en,xwinograd_fr,xwinograd_jp,xwinograd_pt,xwinograd_ru,xwinograd_zh

hf (pretrained=EleutherAI/pythia-160m), limit: None, num_fewshot: 0, batch_size: 32
|    Task    |Version|Filter|Metric|Value |   |Stderr|
|------------|-------|------|------|-----:|---|-----:|
|xwinograd_en|Yaml   |none  |acc   |0.6297|±  |0.0100|
|xwinograd_fr|Yaml   |none  |acc   |0.5181|±  |0.0552|
|xwinograd_jp|Yaml   |none  |acc   |0.4964|±  |0.0162|
|xwinograd_pt|Yaml   |none  |acc   |0.5171|±  |0.0309|
|xwinograd_ru|Yaml   |none  |acc   |0.5365|±  |0.0281|
|xwinograd_zh|Yaml   |none  |acc   |0.5079|±  |0.0223|

When
TASKS=xwinograd_en

hf (pretrained=EleutherAI/pythia-160m), limit: None, num_fewshot: 0, batch_size: 32
|    Task    |Version|Filter|Metric|Value |   |Stderr|
|------------|-------|------|------|-----:|---|-----:|
|xwinograd_en|Yaml   |none  |acc   |0.6305|±  |  0.01|

The xwinograd_en task has different scores between runs, presumably because of incidental differences in batching.

The text was updated successfully, but these errors were encountered:

yeoedward · 2023-07-25T16:53:39Z

These might be relevant:

Batch size affecting output. huggingface/transformers#2401
Add batch evaluation support when batch_size > 1 bigcode-project/bigcode-evaluation-harness#36 (comment)

haileyschoelkopf · 2023-08-08T19:32:25Z

Thanks so much for a really thorough writeup of this! It's really appreciated.

I'll see if I can find a more self-contained explanation/reference to point to for why this is the case, but I think it is unfortunately expected on GPU, as certain sums will get executed in different orders and accumulate small error because of the non-associativity of floating point ops.

This is something we might be able to improve (but likely not fully fix) if we throw in a torch.use_deterministic_algorithms(True) into our eval loop, but that might cause some unintended consequences for people who are looking to run the library in the middle of their training code as an eval step. I think therefore it's worth just leaving as is (and maybe referencing within a README + encouraging people more to use the stderr we report), but feel free to reopen if you'd like.

yeoedward mentioned this issue Jul 25, 2023

[Refactor] Migrate xwinograd tasks to yaml #695

Merged

haileyschoelkopf closed this as completed Aug 8, 2023

LSinev mentioned this issue Mar 23, 2024

Speed up inference problems #1625

Open

LSinev mentioned this issue Apr 26, 2024

Accuracy gap between single GPU and multiple GPUs #1751

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

For some models and prompts, the loglikelihood changes with the batch size. #704

For some models and prompts, the loglikelihood changes with the batch size. #704

yeoedward commented Jul 25, 2023

yeoedward commented Jul 25, 2023

haileyschoelkopf commented Aug 8, 2023

For some models and prompts, the loglikelihood changes with the batch size. #704

For some models and prompts, the loglikelihood changes with the batch size. #704

Comments

yeoedward commented Jul 25, 2023

yeoedward commented Jul 25, 2023

haileyschoelkopf commented Aug 8, 2023