Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Central repository for results from running the evaluations #662

Open
c1505 opened this issue Jul 6, 2023 · 7 comments
Open

Central repository for results from running the evaluations #662

c1505 opened this issue Jul 6, 2023 · 7 comments

Comments

@c1505
Copy link

c1505 commented Jul 6, 2023

Motivation

I want to use MMLU results by task to better understand the characteristics of LLMs. I am curious to see the differences between architectures and how performance in the tasks change as the parameter count increases. I have found a lot of reporting of a single final result for MMLU, but I can't find data broken down by task for most of these models. I see some results are here https://github.com/EleutherAI/lm-evaluation-harness/tree/master/results .

Other results that do exist are often conflicting.

If full result data was available, it would be easier to spot these discrepancies in results. Hopefully it would also encourage other groups running the evaluations to do the same.

Suggestions

  • Given that hugging face is running this harness already for their open LLM leaderboard, collaborate to get the full results of running the evaluations uploaded in some centralized place. Hugging Face datasets or some other repository seems like it would be easiest.
  • Suggest a repository and tag for others running the evaluation.
  • Report evaluation results for each code release
@c1505
Copy link
Author

c1505 commented Jul 12, 2023

It does look like hugging face is working on making the results of running evaluations public .
Unsure how long that will take.
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard/discussions/73#64a8483ab35f48e37df1a7c8

@haileyschoelkopf
Copy link
Contributor

Thanks for raising this! This is something we might hope to do with the repo, but don't have the manpower to maintain.

Will see about talking to folks from HF about working together on having this set up and if that might be feasible! Likewise, if you have any ideas or willingness to help out in setting up a system for this, definitely also open to that.

@c1505
Copy link
Author

c1505 commented Jul 24, 2023

Thank you for your attention to this issue. I understand the bandwidth concerns and am happy to help with all parts of encouraging more public evaluation results.

  • Hugging Face has recently made their full evaluation results public https://huggingface.co/datasets/open-llm-leaderboard/results . I had errors trying to use it as a dataset, but was able to clone the repo and process the data.
  • I created a sortable leaderboard that breaks down MMLU by task, which can be accessed here: https://huggingface.co/spaces/CoreyMorris/MMLU-by-task-Leaderboard. The site offers download options for the CSV data and includes scatterplots for enhanced data understanding.
  • I intend to update this leaderboard regularly, at least until Hugging Face develops one that provides a task-wise MMLU breakdown.
  • I’ll try to make the leaderboard more useful to folks with visualizations, filtering, and potentially interesting findings from analysis of the data. If you have any suggestions, please let me know.
  1. Strategies to Encourage Public Sharing of Results:
    • Make it easier to upload evaluation data
      • Provide easy to follow instructions
      • Integrating code into the evaluation harness to facilitate easy uploading of results to the Hugging Face hub or other platforms.
    • Providing Nudges:
      • Instructional materials can offer a significant nudge.
      • Collaboration with ArXiv and/or Hugging Face to establish a dedicated section for evaluation data can create a "fill-in-the-blank" effect, nudging researchers to contribute.
  2. Repository Options:
    • Hugging Face Datasets:
      • Pros:
      • Cons:
        • Hugging Face datasets are primarily designed and optimized for training data.
    • Kaggle:
      • Pros:
        • Known for its capability to handle tabular data for data analysis.
        • Built-in visualization in the web interface.
      • Cons:
        • Likely not as popular in the community training, evaluating, and running open source LLMs

@c1505
Copy link
Author

c1505 commented Aug 8, 2023

Preliminary data analysis here https://coreymorrisdata.medium.com/preliminary-analysis-of-mmlu-evaluation-data-insights-from-500-open-source-models-e67885aa364b

@haileyschoelkopf
Copy link
Contributor

May be of interest to you: we have a project we are hoping to push forward where we want to measure how models' performance + predictions differ/are not robust to small variations in task formatting or how answer choice is evaluated, including on MMLU. This would give us a sense of which benchmarks are relatively more robust to such changes in eval decisions and which ones are brittle and lack construct validity.

Link to the thread on our discord where we're organizing this: https://discord.com/channels/729741769192767510/1120714014964588637

@c1505
Copy link
Author

c1505 commented Aug 11, 2023

thanks! i'll check it out :)

@c1505
Copy link
Author

c1505 commented Sep 27, 2023

May be of interest to you: we have a project we are hoping to push forward where we want to measure how models' performance + predictions differ/are not robust to small variations in task formatting or how answer choice is evaluated, including on MMLU. This would give us a sense of which benchmarks are relatively more robust to such changes in eval decisions and which ones are brittle and lack construct validity.

Link to the thread on our discord where we're organizing this: https://discord.com/channels/729741769192767510/1120714014964588637

I found some issues related to formatting on the moral scenarios task, but there are also some other questions with similar formats to those that were shown to be problematic with the moral scenarios task https://medium.com/@coreymorrisdata/is-it-really-about-morality-74fd6e512521 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants