FollowBench: A Multi-level Fine-grained Constraints Following Benchmark for Large Language Models (ACL 2024)

We introduce FollowBench, a Multi-level Fine-grained Constraints Following Benchmark for systemically and precisely evaluate the instruction-following capability of LLMs.

FollowBench comprehensively includes five different types (i.e., Content, Situation, Style, Format, and Example) of fine-grained constraints.
To enable a precise constraint following estimation on diverse difficulties, we introduce a Multi-level mechanism that incrementally adds a single constraint to the initial instruction at each increased level.
To evaluate whether LLMs' outputs have satisfied every individual constraint, we propose to prompt strong LLMs with constraint-evolution paths to handle challenging open-ended instructions.
By evaluating 13 closed-source and open-source popular LLMs on FollowBench, we highlight the weaknesses of LLMs in instruction following and point towards potential avenues for future work.

🔥 Updates

2024/05/16: We are delighted that FollowBench has been accepted to ACL 2024 main conference!
2024/01/11: We have uploaded the English and Chinese version of FollowBench to Hugging Face.
2023/12/20: We evaluated Qwen-Chat-72B/14B/7B on FollowBench, check it in Leaderboard.
2023/12/15: We released a Chinese version of FolllowBench, check it in data_zh/.
2023/11/14: We released the second verson of our paper. Check it out!
2022/11/10: We released the data and code of FollowBench.
2023/10/31: We released the first verson of our paper. Check it out!

🖥️ Leaderboard

Metrics

Hard Satisfaction Rate (HSR): the average rate at which all constraints of individual instructions are fully satisfied
Soft Satisfaction Rate (SSR): the average satisfaction rate of individual constraints across all instructions
Consistent Satisfaction Levels (CSL): how many consecutive levels a model can satisfy, beginning from level 1

Level-categorized Results

English

Chinese

Constraint-categorized Results

English

Chinese

📄 Data of FollowBench

The data of FollowBench can be found in data/.

We also provide a Chinese version of FollowBench in data_zh/.

⚙️ How to Evaluate on FollowBench

Install Dependencies

conda create -n followbench python=3.10
conda activate followbench
conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.7 -c pytorch -c nvidia
pip install -r requirements.txt

Model Inference

cd FollowBench/
python code/model_inference.py --model_path <model_name_or_path>

LLM-based Evaluation

cd FollowBench/
python code/llm_eval.py --model_path <model_name_or_path> --api_key <your_own_gpt4_api_key>

Merge Evaluation and Save Results

Next, we can merge the rule-based evaluation results and LLM-based evaluation results using the following script:

cd FollowBench/
python code/eval.py --model_paths <a_list_of_evaluated_models>

The final results will be saved in the folder named evaluation_result.

📝 Citation

Please cite our paper if you use the data or code in this repo.

@misc{jiang2023followbench,
      title={FollowBench: A Multi-level Fine-grained Constraints Following Benchmark for Large Language Models}, 
      author={Yuxin Jiang and Yufei Wang and Xingshan Zeng and Wanjun Zhong and Liangyou Li and Fei Mi and Lifeng Shang and Xin Jiang and Qun Liu and Wei Wang},
      year={2023},
      eprint={2310.20410},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Name		Name	Last commit message	Last commit date
Latest commit History 110 Commits
code		code
code_zh		code_zh
data		data
data_zh		data_zh
figures		figures
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FollowBench: A Multi-level Fine-grained Constraints Following Benchmark for Large Language Models (ACL 2024)

🔥 Updates

🔍 Table of Contents

🖥️ Leaderboard

Metrics

Level-categorized Results

English

Chinese

Constraint-categorized Results

English

Chinese

📄 Data of FollowBench

⚙️ How to Evaluate on FollowBench

Install Dependencies

Model Inference

LLM-based Evaluation

Merge Evaluation and Save Results

📝 Citation

About

Releases

Packages

Languages

License

dumpmemory/FollowBench

Folders and files

Latest commit

History

Repository files navigation

FollowBench: A Multi-level Fine-grained Constraints Following Benchmark for Large Language Models (ACL 2024)

🔥 Updates

🔍 Table of Contents

🖥️ Leaderboard

Metrics

Level-categorized Results

English

Chinese

Constraint-categorized Results

English

Chinese

📄 Data of FollowBench

⚙️ How to Evaluate on FollowBench

Install Dependencies

Model Inference

LLM-based Evaluation

Merge Evaluation and Save Results

📝 Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages