Evaluation MC Questions #1875

kangqi-ni · 2024-05-23T05:41:18Z

I have a question regarding evaluating LLMs on mc questions using loglikelihood of tokens. From existing implementations like MMLU etc, the code snippet would look like this:

# Create the model
model = AutoModelForCausalLM.from_pretrained(...)
tokenizer = AutoTokenizer.from_pretrained(...)

# Tokenize the inputs
inputs = tokenizer(...)

# Get the logits
with torch.no_grad():
    outputs = model(input_ids=inputs["input_ids"].to(model.device), attention_mask=inputs["attention_mask"].to(model.device))
logits = outputs.logits[:, -1, :]

# Compute probabilities
probs = (
    torch.nn.functional.softmax(
        torch.tensor(
            [
                logits[tokenizer(" A").input_ids[-1]], # or sometimes tokenizing "A" instead of " A" etc
                logits[tokenizer(" B").input_ids[-1]],
                logits[tokenizer(" C").input_ids[-1]],
                logits[tokenizer(" D").input_ids[-1]],
            ]
        ).float(),
        dim=0,
    )
    .detach()
    .cpu()
    .numpy()
)
prediction = {0: "A", 1: "B", 2: "C", 3: "D"}[np.argmax(probs)]

However, the result from doing this is usually very different from lm evaluation harness. For example, running evaluations on llama would result in 2 different numbers. I looked into the code snippet within this repo but cannot be sure what is the difference. I understand that this repo has more wrappers around, but under the hood, is the idea the same as the code snippet? I am a bit confused since I get vastly different results from lm eval harness by coding in this way.

The text was updated successfully, but these errors were encountered:

haileyschoelkopf · 2024-05-23T12:41:33Z

Yes, the result should be the same (at least for the single-token completions case). Have you compared the logits you get to the observed per-sample values? (the ones we log won't have the softmax applied to them though, we just select the largest one)

haileyschoelkopf · 2024-05-23T12:50:30Z

Something to flag is that occasionally model predictions can flip due to things as minor as running with a different batch size, if logits on two answer choices were very very close or the same floating point value. But there should not be large divergences

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation MC Questions #1875

Evaluation MC Questions #1875

kangqi-ni commented May 23, 2024 •

edited

Loading

haileyschoelkopf commented May 23, 2024

haileyschoelkopf commented May 23, 2024

Evaluation MC Questions #1875

Evaluation MC Questions #1875

Comments

kangqi-ni commented May 23, 2024 • edited Loading

haileyschoelkopf commented May 23, 2024

haileyschoelkopf commented May 23, 2024

kangqi-ni commented May 23, 2024 •

edited

Loading