Skip to content

felipemaiapolo/prompteval

Repository files navigation

Efficient multi-prompt evaluation of LLMs

Welcome to the PromptEval GitHub repository! Here you will find more information about our implementation of PromptEval and datasets introduced in

Maia Polo, Felipe, Ronald Xu, Lucas Weber, Mírian Silva, Onkar Bhardwaj, Leshem Choshen, Allysson Flavio Melo de Oliveira, Yuekai Sun, and Mikhail Yurochkin. "Efficient multi-prompt evaluation of LLMs." arXiv preprint arXiv:2405.17202 (2024).

Overview

Most popular benchmarks for comparing LLMs rely on a limited set of prompt templates, which may not fully capture the LLMs’ abilities and can affect the reproducibility of results on leaderboards. This repository introduces our implementation of PromptEval, a method for estimating performance across a large set of prompts by borrowing strength across prompts and examples to produce accurate estimates under practical evaluation budgets.

Quick start

Please check our demo on how to use PromptEval in your own data.

Repository Structure

  • data/: Contains the evaluation data used in the experiments.
  • prompteval/: Source code for the PromptEval method and utilities.
  • notebooks/: Jupyter notebooks used to create plots for the PromptEval paper.
  • results/: Results from the experiments conducted in the paper.
  • mmlu_data/: Contains code for gathering evaluation data.

Installation

To use the code in this repository, clone the repo and install the required dependencies:

git clone https://github.com/felipemaiapolo/prompteval.git
cd prompteval
pip install -e .

Reproducing the results of the paper

To reproduce the results in our paper, please follow the steps after cloning the repo and installing dependencies:

  1. Download the BBH and LMentry data, produced by the authors of "State of What Art? A Call for Multi-Prompt LLM Evaluation", from here. Place the unzipped folder "raw open-source model responses with gold and auto validation values" inside the data directory;
  2. Process data by running create_data.py;
  3. Run main experiments by running dist_evaluation.py. Example: python dist_evaluation.py --bench 'BBH' --random_seeds 5;
  4. Run best prompt identification by running bai_evaluation.py. Example: python bai_evaluation.py --bench 'BBH' --random_seeds 5.
  5. Create plots using the notebooks in the notebooks directory.

Fine-tuning embeddings

To fine-tune BERT representations run the following:

python ./prompteval/ft_representations.py --model_name "bert-base-uncased" \
                             --lr 2e-05 \
                             --weight_decay 1e-06 \
                             --gamma .99995 \
                             --bs 96 \
                             --n_epochs 5 \
                             --warmup_steps 200 \
                             --bench "BBH" 

Note, that this requires the file ./data/Ys.pickle to contain correctness data for the respective benchmark as the create_data.py script creates it. Add --push_to_hub, to automatically push the resulting model to your namespace on the huggingface hub (remember to huggingface-cli login before training).

MMLU Data

We make our MMLU collected data available on Hugging Face. The data includes evaluation for 15 different SOTA LLMs and 100 different prompt templates.

Citing

@article{polo2024efficient,
title={Efficient multi-prompt evaluation of LLMs},
author={Polo, Felipe Maia and Xu, Ronald and Weber, Lucas and Silva, M{\'\i}rian and Bhardwaj, Onkar and Choshen, Leshem and de Oliveira, Allysson Flavio Melo and Sun, Yuekai and Yurochkin, Mikhail},
journal={arXiv preprint arXiv:2405.17202},
year={2024}
}

About

Efficient multi-prompt evaluation of LLMs

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages