Output logits differ for the same input text in a batch of size 1 with half precision on GPU #28732

zhukpm · 2024-01-26T16:34:34Z

System Info

Linux 20.04.1-Ubuntu x86_64 GNU/Linux
Python 3.10.12
transformers==4.37.1
torch==2.1.2+cu121

GPU A100
NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.0

Who can help?

@ArthurZucker

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

We run an inference with a CausalLM model, providing the same text, but in different batches. One of the batches is of size 1, and the other - of size > 1. Output logits differ slightly for the same input sequence.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed


# MODEL_ID = 'mistralai/Mistral-7B-Instruct-v0.2'
MODEL_ID = 'facebook/opt-350m'


model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    torch_dtype=torch.bfloat16,
    device_map='auto',
    return_dict=True
)
model.eval()

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = tokenizer.pad_token_id

batches = [
    ['hello, world'],
    ['hello, world', 'hello', 'world']
]

tokenized = [tokenizer(b, padding='longest', return_tensors='pt').to(model.device) for b in batches]

assert (tokenized[0]['input_ids'][0] == tokenized[1]['input_ids'][0]).all().item()

set_seed(0)
with torch.inference_mode():
    logits = [model(**t).logits for t in tokenized]

assert torch.allclose(logits[0][0], logits[1][0], atol=1e-3)

Expected behavior

Output logits should be the same (at least very close to other) regardless of the batch size.
Note that we observe this problem only with torch.float16 and torch.bfloat16 on GPUs.

The code above works without errors

on CPUs
when using float32
when comparing batches of sizes e.g. 2 and 3:

batches = [
    ['hello, world', 'hello'],
    ['hello, world', 'hello', 'world']
]

So for some reason the problem occurs for half precision and batch_size=1 only.

I think that this thread might be related somehow, but I'm not sure.

The text was updated successfully, but these errors were encountered:

ArthurZucker · 2024-01-26T17:48:45Z

This seems like a duplicate of #25420 (comment)

zhukpm · 2024-01-29T10:49:03Z

Yeah, it is. Thanks

github-actions · 2024-02-26T08:03:55Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

ArthurZucker closed this as completed Feb 27, 2024

LSinev mentioned this issue Mar 27, 2024

Speed up inference problems EleutherAI/lm-evaluation-harness#1625

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Output logits differ for the same input text in a batch of size 1 with half precision on GPU #28732

Output logits differ for the same input text in a batch of size 1 with half precision on GPU #28732

zhukpm commented Jan 26, 2024

ArthurZucker commented Jan 26, 2024

zhukpm commented Jan 29, 2024

github-actions bot commented Feb 26, 2024

Output logits differ for the same input text in a batch of size 1 with half precision on GPU #28732

Output logits differ for the same input text in a batch of size 1 with half precision on GPU #28732

Comments

zhukpm commented Jan 26, 2024

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

ArthurZucker commented Jan 26, 2024

zhukpm commented Jan 29, 2024

github-actions bot commented Feb 26, 2024