Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

For some models and prompts, the loglikelihood changes with the batch size. #704

Closed
yeoedward opened this issue Jul 25, 2023 · 2 comments
Closed

Comments

@yeoedward
Copy link
Contributor

I think this problem is in the underlying transformers library, but I'm creating an issue here to document the behavior as it results in inconsistent evaluation scores. This issue was encountered in #695.

To reproduce:


model: pretrained=EleutherAI/pythia-160m
context: The SWAT team moved in on the compound to prevent the terrorists from launching a deadly missile because the terrorists
continuation: were trying to terrorize the global population.

import lm_eval
import lm_eval.api.registry
import lm_eval.models.huggingface
from lm_eval.api.instance import Instance

lm = lm_eval.api.registry.get_model('hf').create_from_arg_string(
  'pretrained=EleutherAI/pythia-160m',
  {
    "batch_size": 32,
    "max_batch_size": None,
    "device": "cuda",
  },
)
req = Instance(
    request_type='loglikelihood',
    arguments=("The SWAT team moved in on the compound to prevent the terrorists from launching a deadly missile because the terrorists", " were trying to terrorize the global population."),
    doc=0,
    idx=0,
    repeats=1,
)

A batch of four requests

lm.loglikelihood([req, req, req, req])

Returns

[(-21.84375, False),
 (-21.84375, False),
 (-21.84375, False),
 (-21.84375, False)]

While a batch of five requests

lm.loglikelihood([req, req, req, req, req]

Returns

[(-21.765625, False),
 (-21.765625, False),
 (-21.765625, False),
 (-21.765625, False),
 (-21.765625, False)]

The same issue is also present for

model: pretrained=facebook/opt-125m
context: Bush beat Gore because Gore
continuation: was unpopular.


The problem also manifests when running the following command with different tasks:

python main.py --model hf --model_args pretrained=EleutherAI/pythia-160m --tasks $TASKS --batch_size 32

When
TASKS=xwinograd_en,xwinograd_fr,xwinograd_jp,xwinograd_pt,xwinograd_ru,xwinograd_zh

hf (pretrained=EleutherAI/pythia-160m), limit: None, num_fewshot: 0, batch_size: 32
|    Task    |Version|Filter|Metric|Value |   |Stderr|
|------------|-------|------|------|-----:|---|-----:|
|xwinograd_en|Yaml   |none  |acc   |0.6297|±  |0.0100|
|xwinograd_fr|Yaml   |none  |acc   |0.5181|±  |0.0552|
|xwinograd_jp|Yaml   |none  |acc   |0.4964|±  |0.0162|
|xwinograd_pt|Yaml   |none  |acc   |0.5171|±  |0.0309|
|xwinograd_ru|Yaml   |none  |acc   |0.5365|±  |0.0281|
|xwinograd_zh|Yaml   |none  |acc   |0.5079|±  |0.0223|

When
TASKS=xwinograd_en

hf (pretrained=EleutherAI/pythia-160m), limit: None, num_fewshot: 0, batch_size: 32
|    Task    |Version|Filter|Metric|Value |   |Stderr|
|------------|-------|------|------|-----:|---|-----:|
|xwinograd_en|Yaml   |none  |acc   |0.6305|±  |  0.01|

The xwinograd_en task has different scores between runs, presumably because of incidental differences in batching.

@haileyschoelkopf
Copy link
Collaborator

Thanks so much for a really thorough writeup of this! It's really appreciated.

I'll see if I can find a more self-contained explanation/reference to point to for why this is the case, but I think it is unfortunately expected on GPU, as certain sums will get executed in different orders and accumulate small error because of the non-associativity of floating point ops.

This is something we might be able to improve (but likely not fully fix) if we throw in a torch.use_deterministic_algorithms(True) into our eval loop, but that might cause some unintended consequences for people who are looking to run the library in the middle of their training code as an eval step. I think therefore it's worth just leaving as is (and maybe referencing within a README + encouraging people more to use the stderr we report), but feel free to reopen if you'd like.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants