Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Selfcheckgpt evaluation to tasks #1080

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

erenup
Copy link

@erenup erenup commented Dec 7, 2023

No description provided.

Copy link

@accesslint accesslint bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are accessibility issues in these changes.

lm_eval/tasks/selfcheckgpt/README.md Outdated Show resolved Hide resolved
@CLAassistant
Copy link

CLAassistant commented Dec 7, 2023

CLA assistant check
All committers have signed the CLA.

Copy link

@accesslint accesslint bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👏 You fixed the issue(s)! Great work.

@erenup
Copy link
Author

erenup commented Dec 7, 2023

I am trying to add a new task for llm hallucination. This is the very initial code. There is something I'd like to discuss with everyone about:

  • How to generate multiple generations for one prompt elegantly?This could be very useful for many llm self-consistent evaluations.

I have implemented one possible solution. But I am not sure it is the best one.

Thank you very much.

@erenup erenup changed the title Selfcheckgpt Add Selfcheckgpt evaluation to tasks Dec 7, 2023
Copy link
Contributor

@haileyschoelkopf haileyschoelkopf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left a few comments as review! I believe the functionality you want (N generations per document created) is already supported without any novel YAML config options. Please let me know if I am misunderstanding your intended functionality

lm_eval/tasks/selfcheckgpt/selfcheckgpt.yaml Outdated Show resolved Hide resolved
lm_eval/tasks/selfcheckgpt/selfcheckgpt.yaml Outdated Show resolved Hide resolved
lm_eval/tasks/selfcheckgpt/selfcheckgpt.yaml Outdated Show resolved Hide resolved
lm_eval/tasks/selfcheckgpt/utils.py Outdated Show resolved Hide resolved
lm_eval/api/task.py Outdated Show resolved Hide resolved
@erenup
Copy link
Author

erenup commented Dec 18, 2023

hi @haileyschoelkopf @lintangsutawika, I have refactored the code so that it has its own task.py to deal with its unique multiple generations kwargs. In order to make it to be recognized by the llm-eval. I imported the selfcheckgpt in task/init.py which is similar to squadv2. Thank you very much.

@StellaAthena
Copy link
Member

StellaAthena commented Dec 18, 2023

@erenup Thank you for your contribution to our library. We will not be able to merge this PR until you do the following:

  1. Sign the CLA
  2. Change the readme to accord with our template. It currently looks like you copied the readme from the official code, which will confuse users as it refers to code not found in this repo and uses the term "we" to refer to people other than EleutherAI.
  3. Run some models through the code and compare results with the official implementation. Results on LLaMA, RWKV, Pythia, and T0 would cover all the major bases.

@erenup
Copy link
Author

erenup commented Dec 23, 2023

hi @StellaAthena Thank you.

  • Sign the CLA.

    • done
  • Change the readme to accord with our template. It currently looks like you copied the readme from the official code, which will confuse users as it refers to code not found in this repo and uses the term "we" to refer to people other than EleutherAI.

    • done
  • Run some models through the code and compare results with the official implementation. Results on LLaMA, RWKV, Pythia, and T0 would cover all the major bases.

    • I have run some experiments and the results should be the same as the official repo since the evaluation code in task.py copies the selfchckgpt readme API and does not change other things. below numbers are results:
  • some results:
    -- gpt-6b on SelfCheckNLI

export SELFCHECKGPTDEVICE=cuda
export SELFCHECKGPTTYPE=SelfCheckNLI
lm_eval --model hf \
    --model_args pretrained=${model_path_gpt-j-6b} \
    --tasks selfcheckgpt \
    --device cuda:0 \
    --batch_size 32

-- image

  • -llama2 7b on SelfCheckNLI
export SELFCHECKGPTDEVICE=cuda
export SELFCHECKGPTTYPE=SelfCheckNLI
lm_eval --model hf \
    --model_args pretrained=${model_path_Llama-2-7b-chat-hf} \
    --tasks selfcheckgpt \
    --device cuda:0 \
    --batch_size 32

-- image

@StellaAthena StellaAthena requested review from lintangsutawika and haileyschoelkopf and removed request for StellaAthena January 2, 2024 19:39
@StellaAthena StellaAthena dismissed stale reviews from lintangsutawika and haileyschoelkopf January 10, 2024 21:22

It has been addressed and a new review is needed

@StellaAthena
Copy link
Member

@lintangsutawika @haileyschoelkopf this is now ready for your review again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants