Add NQ-Open task based on the Natural Questions dataset #789
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I have added the NQ-Open task which is based on the Natural Questions dataset. This is the version of NQ that is commonly used for large language model evaluation in the open-domain question answering but closed-book setting. Most prominently this is the exact dataset that was used in the evaluation of LLaMA and Llama-2. GPT-3, GPT-4, PaLM, and PaLM-2 also evaluate on this task but it is not clear if this is the same split.
Homepage: https://github.com/google-research-datasets/natural-questions/tree/master/nq_open
From the homepage:
Also related to #9.
I have based the implementation on the TriviaQA implementation and followed the common evaluation setting in the papers mentioned above, i.e. we are using case-insensitive exact match after normalizing by removing articles and duplicate whitespace.
I tried to reproduce the Llama-2 evaluation using this implementation using the task description from the Llama paper ("Answer these questions:"), scores are as follows:
Not sure if this explains the better 0-shot performance but the main remaining difference is that my implementation uses "Question:" and "Answer:" in the prompt whereas Llama used "Q:" and "A:".