Merge pull request #33 from prometheus-eval/feat/bgb

New Feature: Support BiGGen-Bench Evaluation
prometheus-eval · Jun 5, 2024 · a760f9d · a760f9d
2 parents 2869d38 + 9549363
commit a760f9d
Show file tree

Hide file tree

Showing 78 changed files with 38,729 additions and 2,518 deletions.
diff --git a/BiGGen-Bench/.env.template b/BiGGen-Bench/.env.template
diff --git a/BiGGen-Bench/.gitignore b/BiGGen-Bench/.gitignore
@@ -0,0 +1,7 @@
+.env
+*_evals.json
+*_responses.json
+!sample_evals.json
+!sample_responses.json
+package_init.sh
+init.sh
diff --git a/BiGGen-Bench/README.md b/BiGGen-Bench/README.md
@@ -0,0 +1,181 @@
+<p align="center">
+ <img src="https://raw.githubusercontent.com/prometheus-eval/prometheus-eval/feat/bgb/BiGGen-Bench/assets/logo.png" alt="BiGGen-Bench-Logo" style="width: 25%; display: block; margin: auto;">
+</p>
+
+<h1 align="center"> BiGGen-Bench </h1>
+
+<a href="https://huggingface.co/datasets/prometheus-eval/BiGGen-Bench"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Dataset-ffd21e" alt="Hugging Face Datasets"></a>
+<a href="https://huggingface.co/prometheus-eval/prometheus-bgb-8x7b-v2.0"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-ff9d00" alt="Hugging Face Model"></a>
+<a href="https://huggingface.co/spaces/prometheus-eval/BiGGen-Bench-Leaderboard"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Leaderboard-orange" alt="Hugging Face Model"></a>
+
+
+BiGGen-Bench is a dedicated benchmarking platform designed to evaluate the nuanced capabilities of Large Language Models across a variety of complex and diverse tasks.
+
+## 🚀 Features
+
+- **Evaluation Scope**: Covers nine key capabilities of LLMs across 77 tasks, with 765 unique instances tailored to test specific aspects of model performance.
+- **Scoring System**: Utilizes a detailed scoring rubric from 1 to 5, reflecting a range of outcomes based on instance-specific criteria closely aligned with the nuanced requirements of each task.
+- **Transparency and Openness**: All codes, data, and detailed evaluation results are publicly available to foster transparency and enable community-driven enhancements and verifications.
+
+
+## 📋 Prerequisites
+
+Before you dive in, make sure you have the following:
+
+- **Python 3.10+**: The scripts are tested with Python 3.7 and later versions. You can download Python from [here](https://www.python.org/downloads/).
+- **Pip**: Python's package installer. It usually comes with Python; make sure it's updated to the latest version using `python -m pip install --upgrade pip`.
+- **Virtual Environment** (optional but recommended)
+
+## 🚀 Installation
+
+First, clone the repository and move to the project directory.
+
+```bash
+git clone https://github.com/prometheus-eval/prometheus-eval.git
+cd prometheus-eval
+cd BiGGen-Bench
+```
+
+Install the necessary Python packages:
+
+```bash
+pip install -r requirements.txt
+```
+
+This will install all required libraries, including `promethues-eval`, `vllm`, `huggingface_hub`, `pandas`, `transformers`, and others that are crucial for running the scripts.
+
+## 📁 Project Structure
+
+The toolkit contains several scripts categorized based on their functionality:
+
+- **Inference Scripts**:
+ - `run_api_inference.py`: Runs inference using AsyncLiteLLM from a lite version of language models.
+ - `run_base_inference.py`: Executes inference with base pre-trained models and handles specific formatting.
+ - `run_chat_inference.py`: Specializes in generating responses for chat-based interactions using AutoTokenizer.
+
+- **Evaluation Scripts**:
+ - `run_response_eval.py`: Evaluates the responses generated by inference scripts using various evaluation metrics.
+ - `make_table.py`: Generates a summary table from the evaluation results, presenting average scores and insights.
+
+Each script is equipped with command-line interface (CLI) support for easy configuration and execution.
+
+## 🖥️ Usage
+
+Here's how to run the scripts:
+
+### **Running Inference**:
+**For API model inference**:
+```bash
+python run_api_inference.py --model_name "your-model-name" --output_file_path "./outputs/api_response.json"
+```
+- With the help of `litellm`, you can use various APIs from different providers. We mainly used OpenAI API and OpenRouter. Refer to [openrouter/models](https://openrouter.ai/models) for information on supported models.
+
+- Note that you have to have your API key ready in a separate `.env` file for the inference!
+
+**For base model inference**:
+
+
+```bash
+python run_base_inference.py --model_name "your-model-name" --output_file_path "./outputs/base_response.json"
+```
+- The model must be in Huggingface Hub and supported by `vllm`.
+
+**For chat model inference**:
+
+```bash
+python run_chat_inference.py --model_name "your-model-name" --output_file_path "./outputs/chat_response.json"
+```
+- The model must be in Huggingface Hub and supported by `vllm`.
+
+
+- If you already have your own infernece script, make sure to make your response file in the format of [`sample_responses.json`](/BiGGen-Bench/sample_responses.json).
+
+ ```json
+ {
+ "planning_travel_plan_0": {
+ "id": "planning_travel_plan_0",
+ "capability": "planning",
+ "task": "travel_plan",
+ "instance_idx": 0,
+ "system_prompt": "You are a travel agent that can design travel plans.",
+ "input": "Design a travel plan for a tourist traveling to the given destination. The tourist has a list of requirements and you should design your plan such that it satisfies all of these requirements.\n\nDestination: Paris\n\nRequirements:\n- Total Duration: 2 days and 1 night\n- Transportation: Walk\n- Must Have: Eiffel Tower, Louvre Museum, Escargot\n- Optional: Croissant, Onion Soup, Notre Dame Cathedral",
+ "reference_answer": "Day 1 - Morning:\n- Visit the Louvre Museum (3 hours)\n- Walk to Caf\u00e9 de Flore (15 minutes)\nDay 1 - Lunch:\n- Caf\u00e9 de Flore - Enjoy croissants and French cuisine (1 hour)\nDay 1 - Afternoon:\n- Walk to Notre Dame Cathedral (20 minutes)\n- Explore Notre Dame (1.5 hours)\n- Walk to Eiffel Tower (1 hour)\nDay 1 - Evening:\n- Visit Eiffel Tower (2 hours)\n- Walk to Le Petit Cler (15 minutes)\nDay 1 - Dinner:\n- Le Petit Cler - Try Escargot and French Onion Soup (1.5 hours)\n\nDay 2 - Morning:\n- Leisure time around Eiffel Tower area (1 hour)\n- Walk to Sainte-Chapelle (1 hour)\nDay 2 - Midday:\n- Visit Sainte-Chapelle (1 hour)\n- Walk to nearby caf\u00e9 (20 minutes)\nDay 2 - Lunch:\n- Les Deux Magots - Experience Parisian caf\u00e9 culture (1 hour)\nDay 2 - Afternoon:\n- Stroll along the Seine River, enjoy local street food and shops (2 hours)",
+ "score_rubric": {
+ "criteria": "Does the response effectively plan a tourist's 2-day trip to Paris, incorporating the must-have experiences within the given constraints (walking as transportation, including Eiffel Tower, Louvre Museum, and Escargot), while also considering the optional items?",
+ "score1_description": "The response outlines a basic itinerary but misses more than two must-have experiences or does not consider the walking-only transportation constraint, showing a lack of detailed planning.",
+ "score2_description": "The response includes all must-have experiences and adheres to the walking-only transportation constraint, but the plan is unrealistic or poorly structured (e.g., significant backtracking or timing issues), showing some effort in planning but lacking in feasibility or efficiency.",
+ "score3_description": "The response successfully incorporates all must-have experiences and the walking-only transportation constraint, with a realistic and somewhat efficient itinerary. However, it misses the opportunity to include any optional items, showing good planning but limited optimization for the tourist's experience.",
+ "score4_description": "The response includes all must-have experiences and adheres to the walking-only transportation constraint with a realistic and efficient itinerary. It includes at least one optional item, showing very good planning that enhances the tourist's experience.",
+ "score5_description": "The response presents a well-thought-out, efficient, and realistic itinerary that includes all must-have experiences within the walking-only constraint and incorporates all optional items, demonstrating excellent planning and optimization for an enriching tourist experience."
+ },
+ "response": "Hello World!",
+ "response_model_name": "sample_model"
+ },
+ ...
+ }
+ ```
+
+### **Evaluating Responses**:
+- To evaluate responses and generate feedback:
+
+ ```bash
+ python run_response_eval.py --model_name "prometheus-eval/prometheus-7b-v2.0" --input_file_path "./outputs/api_response.json" --output_file_path "./feedback/evaluated.json"
+ ```
+
+- For evaluation with `prometheus-eval/prometheus-8x7b-v2.0` and `prometheus-eval/prometheus-bgb-8x7b-v2.0`, you will have to change `tensor_parallel_size` of `VLLM` to avoid CUDA OOM errors.
+
+
+
+### **Generating Reports**:
+- To create a performance report from the evaluated feedback:
+ ```bash
+ python make_table.py --feedback_file_path "./feedback/evaluated.json"
+ ```
+
+
+## 🛠️ Custom Run
+
+The scripts within this toolkit are implemented to be as neat and comprehensible as possible, making it easy for you to modify them as needed. Whether you want to adjust response generation parameters, focus on specific capabilities, or tune the model configurations to better suit your GPU environment, these scripts are built to accommodate such customizations.
+
+### 🎛️ Adjusting Model Parameters
+
+For instance, if you're using a VLLM (Variable Large Language Model) and wish to optimize its parameters for your specific GPU setup, you can adjust settings such as `tensor_parallel_size`, `gpu_memory_utilization`, `max_model_len`, or `quantization`. This customization allows you to make the most efficient use of your hardware resources. Here's a snippet to guide you on adjusting the VLLM parameters in any script:
+
+```python
+# Example of customizing VLLM parameters
+if model_name.endswith("AWQ"):
+ model = VLLM(model_name, tensor_parallel_size=4, quantization="AWQ") # Adjust `tensor_parallel_size` as needed
+elif model_name.endswith("GPTQ"):
+ model = VLLM(model_name, tensor_parallel_size=4, gpu_memory_utilization=0.9, quantization="GPTQ") # Adjust for your GPU capacity
+else:
+ model = VLLM(model_name, tensor_parallel_size=4, max_model_len=8192) # Default setting
+```
+
+### 🎯 Focusing on Specific Capabilities
+
+If you are interested in testing only a particular capability, such as "reasoning" or "multilingual", you can modify the script to filter out other capabilities. This can be done by inserting a simple conditional check to skip unwanted capabilities during the loading of your dataset:
+
+```python
+# Example of filtering for specific capabilities
+for row in dataset.iterrows():
+ record = row[1].to_dict()
+ if record["capability"] != "desired_capability":
+ continue # Skip processing this record
+ # Your processing logic here
+```
+
+### 🔧 Modifying Response Generation Parameters
+
+Adjusting response generation parameters like `temperature`, `top_p`, or `max_tokens` is straightforward. You can tweak these parameters directly in the `params` dictionary used in the completion methods:
+
+```python
+# Example of customizing response generation parameters
+params = {
+ "max_tokens": 512,
+ "temperature": 0.5,
+ "top_p": 0.85,
+ "use_tqdm": True,
+}
+```
+
+By tailoring these scripts to your needs, you can maximize the effectiveness of your evaluations and ensure the toolkit performs optimally within your computational environment. Feel free to dive into the code and make it your own!
diff --git a/BiGGen-Bench/assets/logo.png b/BiGGen-Bench/assets/logo.png
diff --git a/BiGGen-Bench/make_report.py b/BiGGen-Bench/make_report.py
@@ -0,0 +1,77 @@
+import argparse
+import json
+
+from rich import box
+from rich.console import Console
+from rich.table import Table
+
+
+def read_json(file_path):
+ with open(file_path, "r") as file:
+ data = json.load(file)
+ return data
+
+
+def main(args):
+ console = Console()
+
+ feedback_file_path = args.feedback_file_path
+ feedback_data = read_json(feedback_file_path)
+ feedback_data_list = list(feedback_data.values())
+
+ scores = {
+ "grounding": [],
+ "instruction_following": [],
+ "planning": [],
+ "reasoning": [],
+ "refinement": [],
+ "safety": [],
+ "theory_of_mind": [],
+ "tool_usage": [],
+ "multilingual": [],
+ }
+
+ response_model = feedback_data_list[0]["response_model_name"]
+ eval_model = feedback_data_list[0]["eval_model_name"]
+
+ for _, instance in feedback_data.items():
+ capability = instance["capability"]
+ scores[capability].append(instance["score"])
+
+ # Initialize table for output
+ table = Table(
+ title=f"Performance Report for {response_model} graded by {eval_model}",
+ box=box.ROUNDED,
+ )
+ table.add_column("Capability", justify="left", style="cyan", no_wrap=True)
+ table.add_column("Average Score", justify="right", style="green")
+
+ for capability, score_list in scores.items():
+ average_score = sum(score_list) / len(score_list) if score_list else None
+ if average_score is not None:
+ table.add_row(capability, f"{average_score:.3f}")
+ else:
+ table.add_row(capability, "N/A")
+
+ all_scores = [
+ sum(score_list) / len(score_list)
+ for score_list in scores.values()
+ if score_list
+ ]
+ overall_average = sum(all_scores) / len(all_scores) if all_scores else 0
+ table.add_row("Overall", f"{overall_average:.3f}", style="bold red")
+
+ console.print(table)
+
+
+if __name__ == "__main__":
+ parser = argparse.ArgumentParser(description="Score the model")
+ parser.add_argument(
+ "--feedback_file_path",
+ type=str,
+ required=True,
+ help="Path to the feedback file",
+ )
+
+ args = parser.parse_args()
+ main(args)