Skip to content

Commit

Permalink
Merge pull request #33 from prometheus-eval/feat/bgb
Browse files Browse the repository at this point in the history
New Feature: Support BiGGen-Bench Evaluation
  • Loading branch information
scottsuk0306 committed Jun 5, 2024
2 parents 2869d38 + 9549363 commit a760f9d
Show file tree
Hide file tree
Showing 78 changed files with 38,729 additions and 2,518 deletions.
Empty file added BiGGen-Bench/.env.template
Empty file.
7 changes: 7 additions & 0 deletions BiGGen-Bench/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
.env
*_evals.json
*_responses.json
!sample_evals.json
!sample_responses.json
package_init.sh
init.sh
181 changes: 181 additions & 0 deletions BiGGen-Bench/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,181 @@
<p align="center">
<img src="https://raw.githubusercontent.com/prometheus-eval/prometheus-eval/feat/bgb/BiGGen-Bench/assets/logo.png" alt="BiGGen-Bench-Logo" style="width: 25%; display: block; margin: auto;">
</p>

<h1 align="center"> BiGGen-Bench </h1>

<a href="https://huggingface.co/datasets/prometheus-eval/BiGGen-Bench"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Dataset-ffd21e" alt="Hugging Face Datasets"></a>
<a href="https://huggingface.co/prometheus-eval/prometheus-bgb-8x7b-v2.0"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-ff9d00" alt="Hugging Face Model"></a>
<a href="https://huggingface.co/spaces/prometheus-eval/BiGGen-Bench-Leaderboard"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Leaderboard-orange" alt="Hugging Face Model"></a>


BiGGen-Bench is a dedicated benchmarking platform designed to evaluate the nuanced capabilities of Large Language Models across a variety of complex and diverse tasks.

## 🚀 Features

- **Evaluation Scope**: Covers nine key capabilities of LLMs across 77 tasks, with 765 unique instances tailored to test specific aspects of model performance.
- **Scoring System**: Utilizes a detailed scoring rubric from 1 to 5, reflecting a range of outcomes based on instance-specific criteria closely aligned with the nuanced requirements of each task.
- **Transparency and Openness**: All codes, data, and detailed evaluation results are publicly available to foster transparency and enable community-driven enhancements and verifications.


## 📋 Prerequisites

Before you dive in, make sure you have the following:

- **Python 3.10+**: The scripts are tested with Python 3.7 and later versions. You can download Python from [here](https://www.python.org/downloads/).
- **Pip**: Python's package installer. It usually comes with Python; make sure it's updated to the latest version using `python -m pip install --upgrade pip`.
- **Virtual Environment** (optional but recommended)

## 🚀 Installation

First, clone the repository and move to the project directory.

```bash
git clone https://github.com/prometheus-eval/prometheus-eval.git
cd prometheus-eval
cd BiGGen-Bench
```

Install the necessary Python packages:

```bash
pip install -r requirements.txt
```

This will install all required libraries, including `promethues-eval`, `vllm`, `huggingface_hub`, `pandas`, `transformers`, and others that are crucial for running the scripts.

## 📁 Project Structure

The toolkit contains several scripts categorized based on their functionality:

- **Inference Scripts**:
- `run_api_inference.py`: Runs inference using AsyncLiteLLM from a lite version of language models.
- `run_base_inference.py`: Executes inference with base pre-trained models and handles specific formatting.
- `run_chat_inference.py`: Specializes in generating responses for chat-based interactions using AutoTokenizer.

- **Evaluation Scripts**:
- `run_response_eval.py`: Evaluates the responses generated by inference scripts using various evaluation metrics.
- `make_table.py`: Generates a summary table from the evaluation results, presenting average scores and insights.

Each script is equipped with command-line interface (CLI) support for easy configuration and execution.

## 🖥️ Usage

Here's how to run the scripts:

### **Running Inference**:
**For API model inference**:
```bash
python run_api_inference.py --model_name "your-model-name" --output_file_path "./outputs/api_response.json"
```
- With the help of `litellm`, you can use various APIs from different providers. We mainly used OpenAI API and OpenRouter. Refer to [openrouter/models](https://openrouter.ai/models) for information on supported models.

- Note that you have to have your API key ready in a separate `.env` file for the inference!

**For base model inference**:


```bash
python run_base_inference.py --model_name "your-model-name" --output_file_path "./outputs/base_response.json"
```
- The model must be in Huggingface Hub and supported by `vllm`.

**For chat model inference**:

```bash
python run_chat_inference.py --model_name "your-model-name" --output_file_path "./outputs/chat_response.json"
```
- The model must be in Huggingface Hub and supported by `vllm`.


- If you already have your own infernece script, make sure to make your response file in the format of [`sample_responses.json`](/BiGGen-Bench/sample_responses.json).

```json
{
"planning_travel_plan_0": {
"id": "planning_travel_plan_0",
"capability": "planning",
"task": "travel_plan",
"instance_idx": 0,
"system_prompt": "You are a travel agent that can design travel plans.",
"input": "Design a travel plan for a tourist traveling to the given destination. The tourist has a list of requirements and you should design your plan such that it satisfies all of these requirements.\n\nDestination: Paris\n\nRequirements:\n- Total Duration: 2 days and 1 night\n- Transportation: Walk\n- Must Have: Eiffel Tower, Louvre Museum, Escargot\n- Optional: Croissant, Onion Soup, Notre Dame Cathedral",
"reference_answer": "Day 1 - Morning:\n- Visit the Louvre Museum (3 hours)\n- Walk to Caf\u00e9 de Flore (15 minutes)\nDay 1 - Lunch:\n- Caf\u00e9 de Flore - Enjoy croissants and French cuisine (1 hour)\nDay 1 - Afternoon:\n- Walk to Notre Dame Cathedral (20 minutes)\n- Explore Notre Dame (1.5 hours)\n- Walk to Eiffel Tower (1 hour)\nDay 1 - Evening:\n- Visit Eiffel Tower (2 hours)\n- Walk to Le Petit Cler (15 minutes)\nDay 1 - Dinner:\n- Le Petit Cler - Try Escargot and French Onion Soup (1.5 hours)\n\nDay 2 - Morning:\n- Leisure time around Eiffel Tower area (1 hour)\n- Walk to Sainte-Chapelle (1 hour)\nDay 2 - Midday:\n- Visit Sainte-Chapelle (1 hour)\n- Walk to nearby caf\u00e9 (20 minutes)\nDay 2 - Lunch:\n- Les Deux Magots - Experience Parisian caf\u00e9 culture (1 hour)\nDay 2 - Afternoon:\n- Stroll along the Seine River, enjoy local street food and shops (2 hours)",
"score_rubric": {
"criteria": "Does the response effectively plan a tourist's 2-day trip to Paris, incorporating the must-have experiences within the given constraints (walking as transportation, including Eiffel Tower, Louvre Museum, and Escargot), while also considering the optional items?",
"score1_description": "The response outlines a basic itinerary but misses more than two must-have experiences or does not consider the walking-only transportation constraint, showing a lack of detailed planning.",
"score2_description": "The response includes all must-have experiences and adheres to the walking-only transportation constraint, but the plan is unrealistic or poorly structured (e.g., significant backtracking or timing issues), showing some effort in planning but lacking in feasibility or efficiency.",
"score3_description": "The response successfully incorporates all must-have experiences and the walking-only transportation constraint, with a realistic and somewhat efficient itinerary. However, it misses the opportunity to include any optional items, showing good planning but limited optimization for the tourist's experience.",
"score4_description": "The response includes all must-have experiences and adheres to the walking-only transportation constraint with a realistic and efficient itinerary. It includes at least one optional item, showing very good planning that enhances the tourist's experience.",
"score5_description": "The response presents a well-thought-out, efficient, and realistic itinerary that includes all must-have experiences within the walking-only constraint and incorporates all optional items, demonstrating excellent planning and optimization for an enriching tourist experience."
},
"response": "Hello World!",
"response_model_name": "sample_model"
},
...
}
```

### **Evaluating Responses**:
- To evaluate responses and generate feedback:

```bash
python run_response_eval.py --model_name "prometheus-eval/prometheus-7b-v2.0" --input_file_path "./outputs/api_response.json" --output_file_path "./feedback/evaluated.json"
```

- For evaluation with `prometheus-eval/prometheus-8x7b-v2.0` and `prometheus-eval/prometheus-bgb-8x7b-v2.0`, you will have to change `tensor_parallel_size` of `VLLM` to avoid CUDA OOM errors.



### **Generating Reports**:
- To create a performance report from the evaluated feedback:
```bash
python make_table.py --feedback_file_path "./feedback/evaluated.json"
```


## 🛠️ Custom Run

The scripts within this toolkit are implemented to be as neat and comprehensible as possible, making it easy for you to modify them as needed. Whether you want to adjust response generation parameters, focus on specific capabilities, or tune the model configurations to better suit your GPU environment, these scripts are built to accommodate such customizations.

### 🎛️ Adjusting Model Parameters

For instance, if you're using a VLLM (Variable Large Language Model) and wish to optimize its parameters for your specific GPU setup, you can adjust settings such as `tensor_parallel_size`, `gpu_memory_utilization`, `max_model_len`, or `quantization`. This customization allows you to make the most efficient use of your hardware resources. Here's a snippet to guide you on adjusting the VLLM parameters in any script:

```python
# Example of customizing VLLM parameters
if model_name.endswith("AWQ"):
model = VLLM(model_name, tensor_parallel_size=4, quantization="AWQ") # Adjust `tensor_parallel_size` as needed
elif model_name.endswith("GPTQ"):
model = VLLM(model_name, tensor_parallel_size=4, gpu_memory_utilization=0.9, quantization="GPTQ") # Adjust for your GPU capacity
else:
model = VLLM(model_name, tensor_parallel_size=4, max_model_len=8192) # Default setting
```

### 🎯 Focusing on Specific Capabilities

If you are interested in testing only a particular capability, such as "reasoning" or "multilingual", you can modify the script to filter out other capabilities. This can be done by inserting a simple conditional check to skip unwanted capabilities during the loading of your dataset:

```python
# Example of filtering for specific capabilities
for row in dataset.iterrows():
record = row[1].to_dict()
if record["capability"] != "desired_capability":
continue # Skip processing this record
# Your processing logic here
```

### 🔧 Modifying Response Generation Parameters

Adjusting response generation parameters like `temperature`, `top_p`, or `max_tokens` is straightforward. You can tweak these parameters directly in the `params` dictionary used in the completion methods:

```python
# Example of customizing response generation parameters
params = {
"max_tokens": 512,
"temperature": 0.5,
"top_p": 0.85,
"use_tqdm": True,
}
```

By tailoring these scripts to your needs, you can maximize the effectiveness of your evaluations and ensure the toolkit performs optimally within your computational environment. Feel free to dive into the code and make it your own!
Binary file added BiGGen-Bench/assets/logo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
77 changes: 77 additions & 0 deletions BiGGen-Bench/make_report.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
import argparse
import json

from rich import box
from rich.console import Console
from rich.table import Table


def read_json(file_path):
with open(file_path, "r") as file:
data = json.load(file)
return data


def main(args):
console = Console()

feedback_file_path = args.feedback_file_path
feedback_data = read_json(feedback_file_path)
feedback_data_list = list(feedback_data.values())

scores = {
"grounding": [],
"instruction_following": [],
"planning": [],
"reasoning": [],
"refinement": [],
"safety": [],
"theory_of_mind": [],
"tool_usage": [],
"multilingual": [],
}

response_model = feedback_data_list[0]["response_model_name"]
eval_model = feedback_data_list[0]["eval_model_name"]

for _, instance in feedback_data.items():
capability = instance["capability"]
scores[capability].append(instance["score"])

# Initialize table for output
table = Table(
title=f"Performance Report for {response_model} graded by {eval_model}",
box=box.ROUNDED,
)
table.add_column("Capability", justify="left", style="cyan", no_wrap=True)
table.add_column("Average Score", justify="right", style="green")

for capability, score_list in scores.items():
average_score = sum(score_list) / len(score_list) if score_list else None
if average_score is not None:
table.add_row(capability, f"{average_score:.3f}")
else:
table.add_row(capability, "N/A")

all_scores = [
sum(score_list) / len(score_list)
for score_list in scores.values()
if score_list
]
overall_average = sum(all_scores) / len(all_scores) if all_scores else 0
table.add_row("Overall", f"{overall_average:.3f}", style="bold red")

console.print(table)


if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Score the model")
parser.add_argument(
"--feedback_file_path",
type=str,
required=True,
help="Path to the feedback file",
)

args = parser.parse_args()
main(args)
Loading

0 comments on commit a760f9d

Please sign in to comment.