🛠️ToolBench🤖

Model • Data Release • Web Demo • Tool Eval • Paper • Citation

🔨This project (ToolLLM) aims to construct open-source, large-scale, high-quality instruction tuning SFT data to facilitate the construction of powerful LLMs with general tool-use capability. We aim to empower open-source LLMs to master thousands of diverse real-world APIs. We achieve this by collecting a high-quality instruction-tuning dataset. It is constructed automatically using the latest ChatGPT (gpt-3.5-turbo-16k), which is upgraded with enhanced function call capabilities. We provide the dataset, the corresponding training and evaluation scripts, and a capable model ToolLLaMA fine-tuned on ToolBench.

💁‍♂️💁💁‍♀️ Join Us on Discord!

Read this in 中文.

What's New

[2023/8/8] No more hallucination! ToolLLaMA-2-7b (fine-tuned from LLaMA-2-7b) is released with lower API hallucination than ChatGPT.
[2023/8/4] We provide RapidAPI backend service to free you from using your own RapidAPI key and subscribing the APIs. Please fill out our form. We will review it as soon as possible and send you the ToolBench key to get start on it!
[2023/8/1] Our paper is released.
[2023/7/27] New version ToolBench is released.

✨Here is an overview of the dataset construction, training, and evaluation.

✨✨Features:

API Collection: we gather 16464 representational state transfer (REST) APIs from RapidAPI, a platform that hosts massive real-world APIs provided by developers.
Instruction Generation: we curate instructions that involve both single-tool and multi-tool scenarios.
Answer Annotation: we develop a novel depth-first search based decision tree (DFSDT) to bolster the planning and reasoning ability of LLMs, which significantly improves the annotation efficiency and successfully annotates those complex instructions that cannot be answered with CoT or ReACT. We provide responses that not only include the final answer but also incorporate the model's reasoning process, tool execution, and tool execution results.
API Retriver: we incorporate API retrieval to equip ToolLLaMA with open-domain tool-using abilities.
All the data is automatically generated by OpenAI API and filtered by us, the whole data creation process is easy to scale up.

We also provide A demo of using ToolLLaMA

toolbench-demo.mp4

Currently, our ToolLLaMA has reached the performance of ChatGPT (turbo-16k) in tool use, in the future, we will continually improve the data quality and increase the coverage of real-world tools.

Here is the Old version of ToolBench.

Data

👐ToolBench is intended solely for research and educational purposes and should not be construed as reflecting the opinions or views of the creators, owners, or contributors of this dataset. It is distributed under CC BY NC 4.0 License. Below is the statistics of the data :

Tool Nums	API Nums	Instance Nums	Real API Call	Reasoning Traces
3451	16464	12657	37204	4.1

We crawl 16000+ real-world APIs from RapidAPI, and curate realistic human instructions that involve them. Below we present a hierarchy of RapidAPI and our instruction generation process.

ToolBench contains both single-tool and multi-tool scenarios. The multi-tool scenarios can be further categorized into intra-category multi-tool and intra-collection multi-tool. We utilize DFSDT method for all scenarios to our data creation. Here is an illustration for the data creation process using DFSDT method:

Data Release

Please download our dataset using the following link: Google Drive or Tsinghua Cloud.

G1,G2, G3data refers to single-tool, intra-category multi-tool and intra-collection multi-tool data respectively. We also have an Atlas Explorer for visualization.
We split the G1, G2 and G3 data into train, eval and test parts respectively and combine the train data for training in our main experiments. toolllama_G123_dfs_train.json refers to the combined train data.
The tool environment related data is in toolenv directory.
We sample 100 instances from every test set. The test_query_ids directory contains query ids of the test instances in each test set.
The data used for tool retrieval is included in the retrieval directory.

🤖Model

We release the ToolLLaMA-7b, ToolLLaMA-7b-LoRA and ToolLLaMA-2-7b models, which are both trained on the released dataset in a multi-task fashion. We also release the tool retriever trained under our experimental setting.

🚀Fine-tuning

Install

Clone this repository and navigate to the ToolBench folder.

git clone [email protected]:OpenBMB/ToolBench.git
cd ToolBench

Install Package (python>=3.9)

pip install -r requirements.txt

or for ToolEval only

pip install -r toolbench/tooleval/requirements.txt

Prepare the data and tool environment:

wget --no-check-certificate 'https://docs.google.com/uc?export=download&id=1Vis-RxBstXLKC1W1agIQUJNuumPJrrw0&confirm=yes' -O data.zip
unzip data.zip

Training Retriever

Data preprocessing:

export PYTHONPATH=./
python preprocess/preprocess_retriever_data.py \
    --query_file data/instruction/G1_query.json \
    --index_file data/test_query_ids/G1_instruction_test_query_ids.json \
    --dataset_name G1 \
    --output_dir data/retrieval/G1

Then run the following command to train the tool retriever:

export PYTHONPATH=./
python toolbench/retrieval/train.py \
    --data_path data/retrieval/G1/ \
    --model_name bert-base-uncased \
    --output_path retrieval_model \
    --num_epochs 5 \
    --train_batch_size 32 \
    --learning_rate 2e-5 \
    --warmup_steps 500 \
    --max_seq_length 256

Training ToolLLaMA

Our training code is based on FastChat. You can use the following command to train ToolLLaMA-7b with 2 x A100 (80GB), with the preprocessed data in our data link:

export PYTHONPATH=./
torchrun --nproc_per_node=2 --master_port=20001 toolbench/train/train_long_seq.py \
    --model_name_or_path huggyllama/llama-7b  \
    --data_path  data/toolllama_G123_dfs_train.json \
    --eval_data_path  data/toolllama_G123_dfs_eval.json \
    --conv_template tool-llama-single-round \
    --bf16 True \
    --output_dir toolllama \
    --num_train_epochs 2 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "epoch" \
    --prediction_loss_only \
    --save_strategy "epoch" \
    --save_total_limit 8 \
    --learning_rate 5e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.04 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
    --tf32 True \
    --model_max_length 8192 \
    --gradient_checkpointing True \
    --lazy_preprocess True \
    --report_to none

You can also preprocess and split the data in your own way with this command:

export PYTHONPATH=./
python preprocess/preprocess_toolllama_data.py \
    --tool_data_dir data/answer/G1_answer \
    --method DFS_woFilter_w2 \
    --output_file data/answer/toolllama_G1_dfs.json

To train lora version:

export PYTHONPATH=./
deepspeed --master_port=20001 toolbench/train/train_long_seq_lora.py \
    --model_name_or_path huggyllama/llama-7b  \
    --data_path  data/toolllama_G123_dfs_train.json \
    --eval_data_path  data/toolllama_G123_dfs_eval.json \
    --conv_template tool-llama-single-round \
    --bf16 True \
    --output_dir toolllama_lora \
    --num_train_epochs 5 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 2 \
    --evaluation_strategy "epoch" \
    --prediction_loss_only \
    --save_strategy "epoch" \
    --save_total_limit 8 \
    --learning_rate 5e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.04 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --model_max_length 8192 \
    --gradient_checkpointing True \
    --lazy_preprocess True \    
    --deepspeed ds_configs/stage2.json \
    --report_to none

Inference With Our RapidAPI Server

Please fill out the form first and after reviewing we will send you the toolbench key. Then prepare your toolbench key by:

export TOOLBENCH_KEY="your_toolbench_key"

For ToolLLaMA

To inference with ToolLLaMA, run the following commands:

export PYTHONPATH=./
python toolbench/inference/qa_pipeline.py \
    --tool_root_dir data/toolenv/tools/ \
    --backbone_model toolllama \
    --model_path ToolBench/ToolLLaMA-7b \
    --max_observation_length 1024 \
    --observ_compress_method truncate \
    --method DFS_woFilter_w2 \
    --input_query_file data/instruction/inference_query_demo.json \
    --output_answer_file data/answer/toolllama_dfs \
    --toolbench_key $TOOLBENCH_KEY

For ToolLLaMA-LoRA:

export PYTHONPATH=./
python toolbench/inference/qa_pipeline.py \
    --tool_root_dir data/toolenv/tools/ \
    --backbone_model toolllama \
    --model_path huggyllama/llama-7b \
    --lora \
    --lora_path /path/to/your/downloaded/ToolLLaMA-7b-LoRA \
    --max_observation_length 1024 \
    --observ_compress_method truncate \
    --method DFS_woFilter_w2 \
    --input_query_file data/instruction/inference_query_demo.json \
    --output_answer_file data/answer/toolllama_lora_dfs \
    --toolbench_key $TOOLBENCH_KEY

For ToolLLaMA-LoRA under open-domain setting, run:

export PYTHONPATH=./
python toolbench/inference/qa_pipeline_open_domain.py \
    --tool_root_dir data/toolenv/tools/ \
    --corpus_tsv_path data/retrieval/G1/corpus.tsv \
    --retrieval_model_path /path/to/your/retrival_model \
    --retrieved_api_nums 5 \
    --backbone_model toolllama \
    --model_path huggyllama/llama-7b \
    --lora \
    --lora_path /path/to/your/toolllama_lora \
    --max_observation_length 1024 \
    --observ_compress_method truncate \
    --method DFS_woFilter_w2 \
    --input_query_file data/instruction/inference_query_demo_open_domain.json \
    --output_answer_file data/answer/toolllama_lora_dfs_open_domain \
    --toolbench_key $TOOLBENCH_KEY

For OpenAI Models

To use ChatGPT, run:

export TOOLBENCH_KEY=""
export OPENAI_KEY=""
export PYTHONPATH=./
python toolbench/inference/qa_pipeline.py \
    --tool_root_dir data/toolenv/tools/ \
    --backbone_model chatgpt_function \
    --openai_key $OPENAI_KEY \
    --max_observation_length 1024 \
    --method DFS_woFilter_w2 \
    --input_query_file data/instruction/inference_query_demo.json \
    --output_answer_file data/answer/chatgpt_dfs \
    --toolbench_key $TOOLBENCH_KEY

To use Text-Davinci-003, run:

export TOOLBENCH_KEY=""
export OPENAI_KEY=""
export PYTHONPATH=./
python toolbench/inference/qa_pipeline.py \
    --tool_root_dir data/toolenv/tools/ \
    --backbone_model davinci \
    --openai_key $OPENAI_KEY \
    --max_observation_length 1024 \
    --method DFS_woFilter_w2 \
    --input_query_file data/instruction/inference_query_demo.json \
    --output_answer_file data/answer/davinci_dfs \
    --toolbench_key $TOOLBENCH_KEY

Inference With Your Own RapidAPI Account

To do inference with customized RapidAPI account, pass your rapidapi key through rapidapi_key and specify the use_rapidapi_key argument in the script:

export RAPIDAPI_KEY=""
export OPENAI_KEY=""
export PYTHONPATH=./
python toolbench/inference/qa_pipeline.py \
    --tool_root_dir data/toolenv/tools/ \
    --backbone_model chatgpt_function \
    --openai_key $OPENAI_KEY \
    --max_observation_length 1024 \
    --method DFS_woFilter_w2 \
    --input_query_file data/instruction/inference_query_demo.json \
    --output_answer_file data/answer/chatgpt_dfs \
    --rapidapi_key $RAPIDAPI_KEY \
    --use_rapidapi_key

Setting up and running the interface

ToolBench contains a Web UI based on Chatbot UI, forked to include the use of tools in the interface. It comes in two parts: the backend server, and chatbot-ui-toolllama. Here is a video demo.

Web UI

git clone https://github.com/lilbillybiscuit/chatbot-ui-toolllama
cd chatbot-ui-toolllama
npm install
npm run dev

The app will be available on https://localhost:3000/

Backend server

export PYTHONPATH=./
python toolbench/inference/toolbench_server.py \
    --tool_root_dir data/toolenv/tools/ \
    --corpus_tsv_path data/retrieval/G1/corpus.tsv \
    --retrieval_model_path /path/to/your/retrival_model \
    --retrieved_api_nums 5 \
    --backbone_model toolllama \
    --model_path huggyllama/llama-7b \
    --lora \
    --lora_path /path/to/your/toolllama_lora \
    --max_observation_length 1024 \
    --method DFS_woFilter_w2 \
    --input_query_file data/instruction/inference_query_demo_open_domain.json \
    --output_answer_file data/answer/toolllama_lora_dfs_open_domain \
    --rapidapi_key $RAPIDAPIKEY

This server will be available on https://localhost:5000/. To start a request, call https://localhost:5000/stream with a GET or POST request containing a JSON object with the following fields:

{
    "text": "What is the weather in New York today?",
    "top_k": 5,
    "method": "DFS_woFilter_w2"
}

ToolEval

By fine-tuning LLaMA on ToolBench, we obtain ToolLLaMA. Considering that human evaluation can be time-consuming, we follow AlpacaEval to develop an efficient machine evaluator ToolEval, which incorporates two evaluation metrics:

Pass Rate: Calculates the proportion of successfully completing an instruction within limited OpenAI API calls.
Preference: Measured by comparing two answers (action sequences) for a given instruction. We pre-define a set of criteria for a better answer, which are organized as prompts for ChatGPT. We provide the test instruction and two candidate answers to the evaluator and obtain its preference. We evaluate each answer pair multiple times to improve the reliability of our system. Then we calculate the Win Rate (percentage of being preferred by the evaluator) and Standard Error (the standard error of the Win Rate). More details can be found in our paper.

To validate the effectiveness of the metric Preference, we sample among three different methods (ChatGPT+ReACT, GPT4+ReACT, and ChatGPT+DFSDT) to obtain answer pairs for 600 test instructions. Then we engage humans to annotate human preference for them (4 annotations for each answer pair, 2400 annotations in total). Our automatic evaluator, developed using ChatGPT, demonstrates a significant correlation of 75.8% with human annotators. We also obtain the agreement among different human annotators 83.54%, and the agreement between humans and our evaluator 80.21%.

More details about ToolEval can be found in our paper.

Evaluation with ToolEval

To evaluate a model on G1-Inst. test set, for example, run the following commands.

Pass rate:

python toolbench/tooleval/pass_rate.py --answer_dir data/answer/toolllama_dfs/G1_instruction

Win rate (Reference model: ChatGPT-ReACT):

export OPENAI_KEY=""
export REF_MODEL_DATA="data/answer/chatgpt_cot/G1_instruction"
export REF_MODEL_METHOD="CoT"
export TEST_MODEL_DATA="data/answer/toolllama_dfs/G1_instruction"
export TEST_MODEL_METHOD="DFS"
python ./toolbench/tooleval/convert_to_answer_format.py \
    --method CoT \
    --answer_dir $REF_MODEL_DATA \
    --output ${REF_MODEL_DATA}_converted

python ./toolbench/tooleval/convert_to_answer_format.py \
    --method DFS \
    --answer_dir $TEST_MODEL_DATA \
    --output ${TEST_MODEL_DATA}_converted

python ./toolbench/tooleval/automatic_eval_sample.py \
    --output ${REF_MODEL_DATA}_converted \
    --ref_output ${TEST_MODEL_DATA}_converted \
    --method $REF_MODEL_METHOD \
    --use_existed_output

Please refer to ToolEval for more details.

📊 Model Experiments Results

In our main experiments, ToolLLaMA demonstrates a compelling capability to handle both single-tool and complex multi-tool instructions. We introduce hallucinate rate(lower is better) evaluation metric as a complement of ToolEval. An instance is considered to be a hallucinate instance, as long as the whole decision tree contains at least one hallucinated function call. Below are the main results compared with ChatGPT and Text-Davinci-003.

Hallucinate rate:

model	I1-Inst.	I1-Tool.	I1-Cat.	I2-Inst.	I2-Cat.	I3-Inst.	Average
ChatGPT-DFSDT	2	6	5	14	16	17	10
Text-Davinci-003-DFSDT	6	5	5	6	8	6	6.0
ToolLLaMA	16	13	20	24	24	27	20.7
ToolLLaMA-LoRA	61	62	52	69	67	64	62.5
ToolLLaMA-API Retriever	16	25	22	20	23	37	23.8
ToolLLaMA-2	3	11	9	8	10	10	8.5

Pass Rate:

model	I1-Inst.	I1-Tool.	I1-Cat.	I2-Inst.	I2-Cat.	I3-Inst.	Average
ChatGPT-DFSDT	78	84	89	51	58	57	69.6
ChatGPT-ReACT	56	62	66	28	22	30	44.0
Text-Davinci-003-DFSDT	53	58	61	38	38	39	47.8
Text-Davinci-003-ReACT	19	25	30	12	11	14	18.5
ToolLLaMA	68	80	75	47	56	40	61.0
ToolLLaMA-LoRA	51	63	61	38	42	45	50.0
ToolLLaMA-API Retriever	62	62	72	45	55	47	57.2
ToolLLaMA-2	64	72	78	50	51	46	59.8

Win Rate: (Reference model: ChatGPT-DFSDT)

model	I1-Inst.	I1-Tool.	I1-Cat.	I2-Inst.	I2-Cat.	I3-Inst.	Average
ChatGPT-DFSDT	50	50	50	50	50	50	50.0
ChatGPT-ReACT	38	32	41	43	22	23	30.7
Text-Davinci-003-ReACT	14	21	18	8	7	12	13.3
Text-Davinci-003-DFSDT	38	34	43	25	20	28	31.3
ToolLLaMA	50	45	45	59	48	46	48.8
ToolLLaMA-LoRA	43	36.4	30	42	45	51	41.2
ToolLLaMA-API Retriever	51	39	44	49	49	55	47.8
ToolLLaMA-2	43	42	46	55	46	50	47.0

TODO

ToolLLaMA will reach GPT-4's tool-use capability.

Resources of Tool Learning

With the powerful capabilities of foundation models, we are eager to see their applications in manipulating various tools. For more resources, please refer to the following:

BMTools. [Project]
Tool Learning Survey. [Paper]
Tool Learning Paper List. [Project]
WebCPM. [Paper]

Citation

Feel free to cite us if you like ToolBench.

@misc{qin2023toolllm,
      title={ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs}, 
      author={Yujia Qin and Shihao Liang and Yining Ye and Kunlun Zhu and Lan Yan and Yaxi Lu and Yankai Lin and Xin Cong and Xiangru Tang and Bill Qian and Sihan Zhao and Runchu Tian and Ruobing Xie and Jie Zhou and Mark Gerstein and Dahai Li and Zhiyuan Liu and Maosong Sun},
      year={2023},
      eprint={2307.16789},
      archivePrefix={arXiv},
      primaryClass={cs.AI}
}

@misc{qin2023tool,
      title={Tool Learning with Foundation Models}, 
      author={Yujia Qin and Shengding Hu and Yankai Lin and Weize Chen and Ning Ding and Ganqu Cui and Zheni Zeng and Yufei Huang and Chaojun Xiao and Chi Han and Yi Ren Fung and Yusheng Su and Huadong Wang and Cheng Qian and Runchu Tian and Kunlun Zhu and Shihao Liang and Xingyu Shen and Bokai Xu and Zhen Zhang and Yining Ye and Bowen Li and Ziwei Tang and Jing Yi and Yuzhang Zhu and Zhenning Dai and Lan Yan and Xin Cong and Yaxi Lu and Weilin Zhao and Yuxiang Huang and Junxi Yan and Xu Han and Xian Sun and Dahai Li and Jason Phang and Cheng Yang and Tongshuang Wu and Heng Ji and Zhiyuan Liu and Maosong Sun},
      year={2023},
      eprint={2304.08354},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Name		Name	Last commit message	Last commit date
Latest commit History 117 Commits
assets		assets
data_example		data_example
docs		docs
ds_configs		ds_configs
preprocess		preprocess
scripts		scripts
toolbench		toolbench
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_ZH.md		README_ZH.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🛠️ToolBench🤖

What's New

Data

Data Release

🤖Model

🚀Fine-tuning

Install

Training Retriever

Training ToolLLaMA

Inference With Our RapidAPI Server

For ToolLLaMA

For OpenAI Models

Inference With Your Own RapidAPI Account

Setting up and running the interface

Web UI

Backend server

ToolEval

Evaluation with ToolEval

📊 Model Experiments Results

TODO

Resources of Tool Learning

Citation

About

Releases

Packages

Languages

License

frogcjn/ToolBench

Folders and files

Latest commit

History

Repository files navigation

🛠️ToolBench🤖

What's New

Data

Data Release

🤖Model

🚀Fine-tuning

Install

Training Retriever

Training ToolLLaMA

Inference With Our RapidAPI Server

For ToolLLaMA

For OpenAI Models

Inference With Your Own RapidAPI Account

Setting up and running the interface

Web UI

Backend server

ToolEval

Evaluation with ToolEval

📊 Model Experiments Results

TODO

Resources of Tool Learning

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages