Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement the Natural Questions evaluation #9

Closed
1 of 2 tasks
StellaAthena opened this issue Sep 16, 2020 · 19 comments · Fixed by #789
Closed
1 of 2 tasks

Implement the Natural Questions evaluation #9

StellaAthena opened this issue Sep 16, 2020 · 19 comments · Fixed by #789
Labels
feature request A feature that isn't implemented yet. good first issue Good for newcomers

Comments

@StellaAthena
Copy link
Member

StellaAthena commented Sep 16, 2020

From the GPT-3 paper

In this section we measure GPT-3’s ability to answer questions about broad factual knowledge. Due to the immense amount of possible queries, this task has normally been approached by using an information retrieval system to find relevant text in combination with a model which learns to generate an answer given the question and the retrieved text. Since this setting allows a system to search for and condition on text which potentially contains the answer it is denoted “open-book”. [RRS20] recently demonstrated that a large language model can perform surprisingly well directly answering the questions without conditioning on auxilliary information. They denote this more restrictive evaluation setting as “closed-book”. Their work suggests that even higher-capacity models could perform even better and we test this hypothesis with GPT-3. We evaluate GPT-3 on the 3 datasets in [RRS20]: Natural Questions [KPR+19], WebQuestions [BCFL13], and TriviaQA [JCWZ17], using the same splits. Note that in addition to all results being in the closed-book setting, our use of few-shot, one-shot, and zero-shot evaluations represent an even stricter setting than previous closed-book QA work: in addition to external content not being allowed, fine-tuning on the Q&A dataset itself is also not permitted.

  • Data processing code implemented
  • Evaluation implemented

The evaluation code should be modeled after the interface in lm_eval/base.py and the example of the BoolQ task in lm_eval/tasks/suerglue.py

@StellaAthena StellaAthena added the feature request A feature that isn't implemented yet. label Sep 16, 2020
@StellaAthena StellaAthena added this to To do in Implementing Evaluations via automation Sep 16, 2020
@cfoster0
Copy link
Contributor

cfoster0 commented Oct 1, 2020

Note: HuggingFace includes this in its datasets package.

https://huggingface.co/datasets/natural_questions

@cfoster0
Copy link
Contributor

cfoster0 commented Oct 5, 2020

Warning: This dataset is super big.

@StellaAthena
Copy link
Member Author

Warning: This dataset is super big.

How big is “super big”?

@cfoster0
Copy link
Contributor

cfoster0 commented Oct 5, 2020

97G.

@sdtblck
Copy link
Contributor

sdtblck commented Oct 5, 2020

what the fuck. Why are we not training on this.

@sdtblck
Copy link
Contributor

sdtblck commented Oct 5, 2020

Ah, dev set is only 1G. But we should add train set to the pile.

@cfoster0
Copy link
Contributor

cfoster0 commented Oct 7, 2020

We would need to dedupe this with Wikipedia, since the bulk of it is just the HTML of Wikipedia pages.

@StellaAthena StellaAthena added Eval Set and removed feature request A feature that isn't implemented yet. labels Oct 23, 2020
@StellaAthena StellaAthena pinned this issue Oct 23, 2020
@anishthite anishthite moved this from To do to In progress in Implementing Evaluations Oct 24, 2020
@anishthite anishthite moved this from In progress to Data integrated, Eval not done in Implementing Evaluations Oct 24, 2020
@StellaAthena StellaAthena unpinned this issue Nov 30, 2020
@StellaAthena StellaAthena reopened this Jan 5, 2021
@StellaAthena StellaAthena added feature request A feature that isn't implemented yet. good first issue Good for newcomers labels Jan 5, 2021
@leogao2 leogao2 moved this from In Progress to To do in Implementing Evaluations Jan 28, 2021
@moirage
Copy link
Collaborator

moirage commented Jan 28, 2021

I can claim this

@StellaAthena
Copy link
Member Author

I can claim this

Assigned!

@StellaAthena StellaAthena moved this from To do to In Progress in Implementing Evaluations Jan 29, 2021
@leogao2 leogao2 moved this from In Progress to To do, Evaluations to Implement in Implementing Evaluations Feb 12, 2021
StellaAthena pushed a commit that referenced this issue Apr 29, 2022
Implementing Evaluations automation moved this from To do, Evaluations to Implement to Done, evaluations Mar 25, 2023
Implementing Evaluations automation moved this from Done, evaluations to To do, Evaluations to Implement Mar 25, 2023
@cr458
Copy link

cr458 commented Mar 27, 2023

would love to take this on if help on implementing the evaluation is still needed?

@StellaAthena
Copy link
Member Author

would love to take this on if help on implementing the evaluation is still needed?

Yes this would be quite helpful. Thanks!

@haileyschoelkopf
Copy link
Contributor

@juletx
Copy link
Contributor

juletx commented Apr 24, 2023

@haileyschoelkopf Some methods are not implemented, they raise NotImplementedError

@haileyschoelkopf
Copy link
Contributor

Ah you're right sorry!--I'm not sure why this was originally merged then. It's not in the task registry though so it should be alright to keep in the repo until the refactor is done, at which point we can decide what to do with it

@memray
Copy link

memray commented Jun 12, 2023

I wonder what the progress of NQ eval is and if any help is needed?

@StellaAthena
Copy link
Member Author

@memray I am under the impression that is hasn't been implemented and help is need.

@wwngh1233
Copy link

+1

1 similar comment
@Sea-Snell
Copy link

+1

qmdnls pushed a commit to qmdnls/lm-evaluation-harness that referenced this issue Aug 17, 2023
Implementing Evaluations automation moved this from To do, Evaluations to Implement to Done, evaluations Aug 21, 2023
@haileyschoelkopf
Copy link
Contributor

Closed by #789 which implements the NaturalQs dataset split used by Llama and (possibly, unconfirmed) used by PaLM and more!

LZY-the-boys pushed a commit to LZY-the-boys/lm-evaluation-harness-fast that referenced this issue Sep 12, 2023
NathanHB referenced this issue in huggingface/lm-evaluation-harness Jun 27, 2024
Add docs on Chat Template interface to `docs/model_guide.md`
lintangsutawika pushed a commit that referenced this issue Jul 8, 2024
remove added metrics -afrimgsm
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request A feature that isn't implemented yet. good first issue Good for newcomers
Projects
No open projects
Implementing Evaluations
  
Done, evaluations
Development

Successfully merging a pull request may close this issue.

10 participants