Skip to content

Latest commit

 

History

History

BiGGen-Bench

BiGGen-Bench-Logo

BiGGen-Bench

Hugging Face Datasets Hugging Face Datasets Hugging Face Model Hugging Face Model

BiGGen-Bench is a dedicated benchmarking platform designed to evaluate the nuanced capabilities of Large Language Models across a variety of complex and diverse tasks.

🚀 Features

  • Evaluation Scope: Covers nine key capabilities of LLMs across 77 tasks, with 765 unique instances tailored to test specific aspects of model performance.
  • Scoring System: Utilizes a detailed scoring rubric from 1 to 5, reflecting a range of outcomes based on instance-specific criteria closely aligned with the nuanced requirements of each task.
  • Transparency and Openness: All codes, data, and detailed evaluation results are publicly available to foster transparency and enable community-driven enhancements and verifications.

📋 Prerequisites

Before you dive in, make sure you have the following:

  • Python 3.10+: The scripts are tested with Python 3.7 and later versions. You can download Python from here.
  • Pip: Python's package installer. It usually comes with Python; make sure it's updated to the latest version using python -m pip install --upgrade pip.
  • Virtual Environment (optional but recommended)

🚀 Installation

First, clone the repository and move to the project directory.

git clone https://github.com/prometheus-eval/prometheus-eval.git
cd prometheus-eval
cd BiGGen-Bench

Install the necessary Python packages:

pip install -r requirements.txt

This will install all required libraries, including promethues-eval, vllm, huggingface_hub, pandas, transformers, and others that are crucial for running the scripts.

📁 Project Structure

The toolkit contains several scripts categorized based on their functionality:

  • Inference Scripts:

    • run_api_inference.py: Runs inference using AsyncLiteLLM from a lite version of language models.
    • run_base_inference.py: Executes inference with base pre-trained models and handles specific formatting.
    • run_chat_inference.py: Specializes in generating responses for chat-based interactions using AutoTokenizer.
  • Evaluation Scripts:

    • run_response_eval.py: Evaluates the responses generated by inference scripts using various evaluation metrics.
    • make_table.py: Generates a summary table from the evaluation results, presenting average scores and insights.

Each script is equipped with command-line interface (CLI) support for easy configuration and execution.

🖥️ Usage

Here's how to run the scripts:

Running Inference:

For API model inference:

python run_api_inference.py --model_name "your-model-name" --output_file_path "./outputs/api_response.json"
  • With the help of litellm, you can use various APIs from different providers. We mainly used OpenAI API and OpenRouter. Refer to openrouter/models for information on supported models.

  • Note that you have to have your API key ready in a separate .env file for the inference!

For base model inference:

python run_base_inference.py --model_name "your-model-name" --output_file_path "./outputs/base_response.json"
  • The model must be in Huggingface Hub and supported by vllm.

For chat model inference:

python run_chat_inference.py --model_name "your-model-name" --output_file_path "./outputs/chat_response.json"
  • The model must be in Huggingface Hub and supported by vllm.

  • If you already have your own infernece script, make sure to make your response file in the format of sample_responses.json.

    {
        "planning_travel_plan_0": {
            "id": "planning_travel_plan_0",
            "capability": "planning",
            "task": "travel_plan",
            "instance_idx": 0,
            "system_prompt": "You are a travel agent that can design travel plans.",
            "input": "Design a travel plan for a tourist traveling to the given destination. The tourist has a list of requirements and you should design your plan such that it satisfies all of these requirements.\n\nDestination: Paris\n\nRequirements:\n- Total Duration: 2 days and 1 night\n- Transportation: Walk\n- Must Have: Eiffel Tower, Louvre Museum, Escargot\n- Optional: Croissant, Onion Soup, Notre Dame Cathedral",
            "reference_answer": "Day 1 - Morning:\n- Visit the Louvre Museum (3 hours)\n- Walk to Caf\u00e9 de Flore (15 minutes)\nDay 1 - Lunch:\n- Caf\u00e9 de Flore - Enjoy croissants and French cuisine (1 hour)\nDay 1 - Afternoon:\n- Walk to Notre Dame Cathedral (20 minutes)\n- Explore Notre Dame (1.5 hours)\n- Walk to Eiffel Tower (1 hour)\nDay 1 - Evening:\n- Visit Eiffel Tower (2 hours)\n- Walk to Le Petit Cler (15 minutes)\nDay 1 - Dinner:\n- Le Petit Cler - Try Escargot and French Onion Soup (1.5 hours)\n\nDay 2 - Morning:\n- Leisure time around Eiffel Tower area (1 hour)\n- Walk to Sainte-Chapelle (1 hour)\nDay 2 - Midday:\n- Visit Sainte-Chapelle (1 hour)\n- Walk to nearby caf\u00e9 (20 minutes)\nDay 2 - Lunch:\n- Les Deux Magots - Experience Parisian caf\u00e9 culture (1 hour)\nDay 2 - Afternoon:\n- Stroll along the Seine River, enjoy local street food and shops (2 hours)",
            "score_rubric": {
                "criteria": "Does the response effectively plan a tourist's 2-day trip to Paris, incorporating the must-have experiences within the given constraints (walking as transportation, including Eiffel Tower, Louvre Museum, and Escargot), while also considering the optional items?",
                "score1_description": "The response outlines a basic itinerary but misses more than two must-have experiences or does not consider the walking-only transportation constraint, showing a lack of detailed planning.",
                "score2_description": "The response includes all must-have experiences and adheres to the walking-only transportation constraint, but the plan is unrealistic or poorly structured (e.g., significant backtracking or timing issues), showing some effort in planning but lacking in feasibility or efficiency.",
                "score3_description": "The response successfully incorporates all must-have experiences and the walking-only transportation constraint, with a realistic and somewhat efficient itinerary. However, it misses the opportunity to include any optional items, showing good planning but limited optimization for the tourist's experience.",
                "score4_description": "The response includes all must-have experiences and adheres to the walking-only transportation constraint with a realistic and efficient itinerary. It includes at least one optional item, showing very good planning that enhances the tourist's experience.",
                "score5_description": "The response presents a well-thought-out, efficient, and realistic itinerary that includes all must-have experiences within the walking-only constraint and incorporates all optional items, demonstrating excellent planning and optimization for an enriching tourist experience."
            },
            "response": "Hello World!",
            "response_model_name": "sample_model"
        },
        ...
    }

Evaluating Responses:

  • To evaluate responses and generate feedback:

    python run_response_eval.py --model_name "prometheus-eval/prometheus-7b-v2.0" --input_file_path "./outputs/api_response.json" --output_file_path "./feedback/evaluated.json"
  • For evaluation with prometheus-eval/prometheus-8x7b-v2.0 and prometheus-eval/prometheus-bgb-8x7b-v2.0, you will have to change tensor_parallel_size of VLLM to avoid CUDA OOM errors.

Generating Reports:

  • To create a performance report from the evaluated feedback:
    python make_table.py --feedback_file_path "./feedback/evaluated.json"

🛠️ Custom Run

The scripts within this toolkit are implemented to be as neat and comprehensible as possible, making it easy for you to modify them as needed. Whether you want to adjust response generation parameters, focus on specific capabilities, or tune the model configurations to better suit your GPU environment, these scripts are built to accommodate such customizations.

🎛️ Adjusting Model Parameters

For instance, if you're using a VLLM (Variable Large Language Model) and wish to optimize its parameters for your specific GPU setup, you can adjust settings such as tensor_parallel_size, gpu_memory_utilization, max_model_len, or quantization. This customization allows you to make the most efficient use of your hardware resources. Here's a snippet to guide you on adjusting the VLLM parameters in any script:

# Example of customizing VLLM parameters
if model_name.endswith("AWQ"):
    model = VLLM(model_name, tensor_parallel_size=4, quantization="AWQ")  # Adjust `tensor_parallel_size` as needed
elif model_name.endswith("GPTQ"):
    model = VLLM(model_name, tensor_parallel_size=4, gpu_memory_utilization=0.9, quantization="GPTQ")  # Adjust for your GPU capacity
else:
    model = VLLM(model_name, tensor_parallel_size=4,  max_model_len=8192)  # Default setting

🎯 Focusing on Specific Capabilities

If you are interested in testing only a particular capability, such as "reasoning" or "multilingual", you can modify the script to filter out other capabilities. This can be done by inserting a simple conditional check to skip unwanted capabilities during the loading of your dataset:

# Example of filtering for specific capabilities
for row in dataset.iterrows():
    record = row[1].to_dict()
    if record["capability"] != "desired_capability":
        continue  # Skip processing this record
    # Your processing logic here

🔧 Modifying Response Generation Parameters

Adjusting response generation parameters like temperature, top_p, or max_tokens is straightforward. You can tweak these parameters directly in the params dictionary used in the completion methods:

# Example of customizing response generation parameters
params = {
    "max_tokens": 512,
    "temperature": 0.5,
    "top_p": 0.85,
    "use_tqdm": True,
}

By tailoring these scripts to your needs, you can maximize the effectiveness of your evaluations and ensure the toolkit performs optimally within your computational environment. Feel free to dive into the code and make it your own!