Add NQ-Open task based on the Natural Questions dataset #789

qmdnls · 2023-08-17T04:46:42Z

I have added the NQ-Open task which is based on the Natural Questions dataset. This is the version of NQ that is commonly used for large language model evaluation in the open-domain question answering but closed-book setting. Most prominently this is the exact dataset that was used in the evaluation of LLaMA and Llama-2. GPT-3, GPT-4, PaLM, and PaLM-2 also evaluate on this task but it is not clear if this is the same split.

Homepage: https://github.com/google-research-datasets/natural-questions/tree/master/nq_open

From the homepage:

The NQ-Open task, introduced by Lee et.al. 2019, is an open domain question answering benchmark that is derived from Natural Questions. The goal is to predict an English answer string for an input English question. All questions can be answered using the contents of English Wikipedia.
The NQ-Open task format was also used as part of the EfficientQA competition at NeurIPS 2020. Results from the EfficientQA competition are reported in Min et.al. 2021.
The EfficientQA competition used different dev and test splits from the original NQ-open task. This repository contains both the original NQ-open data, as well as the EfficientQA data. Users should take care to ensure they are reporting metrics on the correct splits. All work preceeding the EfficientQA competition, in December 2020, reports results on the NQ-open Dev split.

Also related to #9.

I have based the implementation on the TriviaQA implementation and followed the common evaluation setting in the papers mentioned above, i.e. we are using case-insensitive exact match after normalizing by removing articles and duplicate whitespace.

I tried to reproduce the Llama-2 evaluation using this implementation using the task description from the Llama paper ("Answer these questions:"), scores are as follows:

NQ-Open	0-shot	1-shot	5-shot
Llama-2-7B (Paper)	16.4	22.7	25.7
Llama-2-7B (ours)	19.1	23.3	26.3
Llama-2-13B (Paper)	16.1	28.0	31.2
Llama-2-13B (ours)	23.4	26.8	30.4
Llama-2-70B (Paper)	25.3	33.0	39.5
Llama-2-70B (ours)	31.9	34.5	38.6

Not sure if this explains the better 0-shot performance but the main remaining difference is that my implementation uses "Question:" and "Answer:" in the prompt whereas Llama used "Q:" and "A:".

haileyschoelkopf · 2023-08-17T14:08:46Z

Thank you very much for the contribution!

Would you be willing to test the scores of Llama 2 7b and Llama 1 7b with "Q:" and "A:" prompt against their reported scores in the paper? Unless you're aware of other notable papers that report they use "Question:" and "Answer:" as their prompt format, I think since we know this split is used by LLama it makes sense to match their evaluation setup.

qmdnls · 2023-08-18T07:06:52Z

Good point, it seems from what I can gather that PaLM and GPT-4 also use "Q:" and "A:" in their prompts so I agree it would make most sense if we use them as well. I have updated the code.

I have also fixed the normalization, articles at the start of the string weren't being removed properly. The normalization is the same as used by e.g. FiD for NQ and in the SQuAD evaluation code.

Here are the updated evaluation results with adjusted "Q:" and "A:" prompt and fixed normalization:

Llama-1	0-shot	1-shot	5-shot
Llama-1-7B (Paper)	16.8	18.7	22.0
Llama-1-7B (Ours)	18.3	18.8	23.9
Llama-1-13B (Paper)	20.1	23.4	28.1
Llama-1-13B (Ours)	23.2	25.3	29.6

Llama-2	0-shot	1-shot	5-shot
Llama-2-7B (Paper)	16.4	22.7	25.7
Llama-2-7B (ours)	17.5	23.0	26.1
Llama-2-13B (Paper)	16.1	28.0	31.2
Llama-2-13B (ours)	23.0	27.1	30.9
Llama-2-70B (Paper)	25.3	33.0	39.5
Llama-2-70B (ours)	29.2	34.9	39.0

haileyschoelkopf · 2023-08-18T14:06:47Z

Hmm, when I run

python main.py --model hf-causal-experimental --batch_size auto --task nq_opee --model_args pretrained=huggyllama/llama-7b

I get only 5%:

hf-causal-experimental (pretrained=huggyllama/llama-7b), limit: None, provide_description: False, num_fewshot: 0, batch_size: auto
| Task  |Version|Metric|Value |   |Stderr|
|-------|------:|------|-----:|---|-----:|
|nq_open|      0|em    |0.0507|±  |0.0037|

Do you know why this might not match with the score you're reporting? I'm on the most recent commit of your PR branch.

EDIT: I do get 19.47% on 1-shot though:

|-------|------:|------|-----:|---|-----:|
|nq_open|      0|em    |0.1947|±  |0.0066|

These are with batch size = 8.

qmdnls · 2023-08-21T01:30:40Z

For the purpose of direct comparison with the results from the Llama paper I tried to reproduce their approach and prompt as closely as possible and used their task description in the prompt, i.e. "Answer these questions:" (see the Appendix A in the paper https://arxiv.org/pdf/2302.13971.pdf). Whether we use this kind of task description or not makes a much bigger difference in the 0-shot case, without it I saw scores which were similar to what you got.

Looking at your command it looks like you may have run the evaluation without the task description, so I think that would explain the large difference. Could you re-run the eval with the task description and see if scores match my numbers?

I simply used --description_dict descriptions.json and my descriptions.json just looks like this:

{
        "nq_open": "Answer these questions:"
}

haileyschoelkopf · 2023-08-21T13:52:14Z

Thanks, you're correct, I missed that you are passing a description_dict! Confirmed my results now match yours. In that case, I think this is good to merge--thanks very much for the contribution, it's great to finally have NQ in the repo!

qmdnls added 3 commits August 17, 2023 10:04

Added NQ-Open task

a28c03e

Updated NQOpen documentation

9e5acd1

Changed whitespace fix to match existing evaluation code

71d3655

qmdnls requested review from haileyschoelkopf and lintangsutawika as code owners August 17, 2023 04:46

Match prompt with existing works; fix article regex and whitespace

b9b1ac0

haileyschoelkopf linked an issue Aug 18, 2023 that may be closed by this pull request

Implement the Natural Questions evaluation #9

Closed

2 tasks

haileyschoelkopf enabled auto-merge August 21, 2023 13:52

haileyschoelkopf approved these changes Aug 21, 2023

View reviewed changes

haileyschoelkopf merged commit 341f04c into EleutherAI:master Aug 21, 2023
2 checks passed

haileyschoelkopf mentioned this pull request Aug 22, 2023

Implement the Natural Questions evaluation #9

Closed

2 tasks

Hannibal046 mentioned this pull request Jan 15, 2024

about nq_open results #1288

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add NQ-Open task based on the Natural Questions dataset #789

Add NQ-Open task based on the Natural Questions dataset #789

qmdnls commented Aug 17, 2023

haileyschoelkopf commented Aug 17, 2023

qmdnls commented Aug 18, 2023 •

edited

Loading

haileyschoelkopf commented Aug 18, 2023 •

edited

Loading

qmdnls commented Aug 21, 2023

haileyschoelkopf commented Aug 21, 2023 •

edited

Loading

Add NQ-Open task based on the Natural Questions dataset #789

Add NQ-Open task based on the Natural Questions dataset #789

Conversation

qmdnls commented Aug 17, 2023

haileyschoelkopf commented Aug 17, 2023

qmdnls commented Aug 18, 2023 • edited Loading

haileyschoelkopf commented Aug 18, 2023 • edited Loading

qmdnls commented Aug 21, 2023

haileyschoelkopf commented Aug 21, 2023 • edited Loading

qmdnls commented Aug 18, 2023 •

edited

Loading

haileyschoelkopf commented Aug 18, 2023 •

edited

Loading

haileyschoelkopf commented Aug 21, 2023 •

edited

Loading