Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add NQ-Open task based on the Natural Questions dataset #789

Merged
merged 4 commits into from
Aug 21, 2023

Conversation

qmdnls
Copy link

@qmdnls qmdnls commented Aug 17, 2023

I have added the NQ-Open task which is based on the Natural Questions dataset. This is the version of NQ that is commonly used for large language model evaluation in the open-domain question answering but closed-book setting. Most prominently this is the exact dataset that was used in the evaluation of LLaMA and Llama-2. GPT-3, GPT-4, PaLM, and PaLM-2 also evaluate on this task but it is not clear if this is the same split.

Homepage: https://github.com/google-research-datasets/natural-questions/tree/master/nq_open

From the homepage:

The NQ-Open task, introduced by Lee et.al. 2019, is an open domain question answering benchmark that is derived from Natural Questions. The goal is to predict an English answer string for an input English question. All questions can be answered using the contents of English Wikipedia.
The NQ-Open task format was also used as part of the EfficientQA competition at NeurIPS 2020. Results from the EfficientQA competition are reported in Min et.al. 2021.
The EfficientQA competition used different dev and test splits from the original NQ-open task. This repository contains both the original NQ-open data, as well as the EfficientQA data. Users should take care to ensure they are reporting metrics on the correct splits. All work preceeding the EfficientQA competition, in December 2020, reports results on the NQ-open Dev split.

Also related to #9.

I have based the implementation on the TriviaQA implementation and followed the common evaluation setting in the papers mentioned above, i.e. we are using case-insensitive exact match after normalizing by removing articles and duplicate whitespace.

I tried to reproduce the Llama-2 evaluation using this implementation using the task description from the Llama paper ("Answer these questions:"), scores are as follows:

NQ-Open 0-shot 1-shot 5-shot
Llama-2-7B (Paper) 16.4 22.7 25.7
Llama-2-7B (ours) 19.1 23.3 26.3
Llama-2-13B (Paper) 16.1 28.0 31.2
Llama-2-13B (ours) 23.4 26.8 30.4
Llama-2-70B (Paper) 25.3 33.0 39.5
Llama-2-70B (ours) 31.9 34.5 38.6

Not sure if this explains the better 0-shot performance but the main remaining difference is that my implementation uses "Question:" and "Answer:" in the prompt whereas Llama used "Q:" and "A:".

@haileyschoelkopf
Copy link
Contributor

Thank you very much for the contribution!

Would you be willing to test the scores of Llama 2 7b and Llama 1 7b with "Q:" and "A:" prompt against their reported scores in the paper? Unless you're aware of other notable papers that report they use "Question:" and "Answer:" as their prompt format, I think since we know this split is used by LLama it makes sense to match their evaluation setup.

@qmdnls
Copy link
Author

qmdnls commented Aug 18, 2023

Good point, it seems from what I can gather that PaLM and GPT-4 also use "Q:" and "A:" in their prompts so I agree it would make most sense if we use them as well. I have updated the code.

I have also fixed the normalization, articles at the start of the string weren't being removed properly. The normalization is the same as used by e.g. FiD for NQ and in the SQuAD evaluation code.

Here are the updated evaluation results with adjusted "Q:" and "A:" prompt and fixed normalization:

Llama-1 0-shot 1-shot 5-shot
Llama-1-7B (Paper) 16.8 18.7 22.0
Llama-1-7B (Ours) 18.3 18.8 23.9
Llama-1-13B (Paper) 20.1 23.4 28.1
Llama-1-13B (Ours) 23.2 25.3 29.6
Llama-2 0-shot 1-shot 5-shot
Llama-2-7B (Paper) 16.4 22.7 25.7
Llama-2-7B (ours) 17.5 23.0 26.1
Llama-2-13B (Paper) 16.1 28.0 31.2
Llama-2-13B (ours) 23.0 27.1 30.9
Llama-2-70B (Paper) 25.3 33.0 39.5
Llama-2-70B (ours) 29.2 34.9 39.0

@haileyschoelkopf
Copy link
Contributor

haileyschoelkopf commented Aug 18, 2023

Hmm, when I run

python main.py --model hf-causal-experimental --batch_size auto --task nq_opee --model_args pretrained=huggyllama/llama-7b

I get only 5%:

hf-causal-experimental (pretrained=huggyllama/llama-7b), limit: None, provide_description: False, num_fewshot: 0, batch_size: auto
| Task  |Version|Metric|Value |   |Stderr|
|-------|------:|------|-----:|---|-----:|
|nq_open|      0|em    |0.0507|±  |0.0037|

Do you know why this might not match with the score you're reporting? I'm on the most recent commit of your PR branch.

EDIT: I do get 19.47% on 1-shot though:

|-------|------:|------|-----:|---|-----:|
|nq_open|      0|em    |0.1947|±  |0.0066|

These are with batch size = 8.

@haileyschoelkopf haileyschoelkopf linked an issue Aug 18, 2023 that may be closed by this pull request
2 tasks
@qmdnls
Copy link
Author

qmdnls commented Aug 21, 2023

For the purpose of direct comparison with the results from the Llama paper I tried to reproduce their approach and prompt as closely as possible and used their task description in the prompt, i.e. "Answer these questions:" (see the Appendix A in the paper https://arxiv.org/pdf/2302.13971.pdf). Whether we use this kind of task description or not makes a much bigger difference in the 0-shot case, without it I saw scores which were similar to what you got.

Looking at your command it looks like you may have run the evaluation without the task description, so I think that would explain the large difference. Could you re-run the eval with the task description and see if scores match my numbers?

I simply used --description_dict descriptions.json and my descriptions.json just looks like this:

{
        "nq_open": "Answer these questions:"
}

@haileyschoelkopf
Copy link
Contributor

haileyschoelkopf commented Aug 21, 2023

Thanks, you're correct, I missed that you are passing a description_dict! Confirmed my results now match yours. In that case, I think this is good to merge--thanks very much for the contribution, it's great to finally have NQ in the repo!

@haileyschoelkopf haileyschoelkopf merged commit 341f04c into EleutherAI:master Aug 21, 2023
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement the Natural Questions evaluation
2 participants