Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluation MC Questions #1875

Open
kangqi-ni opened this issue May 23, 2024 · 2 comments
Open

Evaluation MC Questions #1875

kangqi-ni opened this issue May 23, 2024 · 2 comments

Comments

@kangqi-ni
Copy link

kangqi-ni commented May 23, 2024

I have a question regarding evaluating LLMs on mc questions using loglikelihood of tokens. From existing implementations like MMLU etc, the code snippet would look like this:

# Create the model
model = AutoModelForCausalLM.from_pretrained(...)
tokenizer = AutoTokenizer.from_pretrained(...)

# Tokenize the inputs
inputs = tokenizer(...)

# Get the logits
with torch.no_grad():
    outputs = model(input_ids=inputs["input_ids"].to(model.device), attention_mask=inputs["attention_mask"].to(model.device))
logits = outputs.logits[:, -1, :]

# Compute probabilities
probs = (
    torch.nn.functional.softmax(
        torch.tensor(
            [
                logits[tokenizer(" A").input_ids[-1]], # or sometimes tokenizing "A" instead of " A" etc
                logits[tokenizer(" B").input_ids[-1]],
                logits[tokenizer(" C").input_ids[-1]],
                logits[tokenizer(" D").input_ids[-1]],
            ]
        ).float(),
        dim=0,
    )
    .detach()
    .cpu()
    .numpy()
)
prediction = {0: "A", 1: "B", 2: "C", 3: "D"}[np.argmax(probs)]

However, the result from doing this is usually very different from lm evaluation harness. For example, running evaluations on llama would result in 2 different numbers. I looked into the code snippet within this repo but cannot be sure what is the difference. I understand that this repo has more wrappers around, but under the hood, is the idea the same as the code snippet? I am a bit confused since I get vastly different results from lm eval harness by coding in this way.

@haileyschoelkopf
Copy link
Contributor

Yes, the result should be the same (at least for the single-token completions case). Have you compared the logits you get to the observed per-sample values? (the ones we log won't have the softmax applied to them though, we just select the largest one)

@haileyschoelkopf
Copy link
Contributor

Something to flag is that occasionally model predictions can flip due to things as minor as running with a different batch size, if logits on two answer choices were very very close or the same floating point value. But there should not be large divergences

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants