Support batched eval (#1110)

modelscope · Jun 11, 2024 · ba98cac · ba98cac
1 parent ddfc9fc
commit ba98cac
Show file tree

Hide file tree

Showing 18 changed files with 545 additions and 223 deletions.
diff --git a/README.md b/README.md
@@ -440,8 +440,18 @@ CUDA_VISIBLE_DEVICES=0 swift infer \
 
 ### Evaluation
 
+Original model:
+```shell
+# We recommend using vLLM for acceleration (arc evaluated in half a minute)
+CUDA_VISIBLE_DEVICES=0 swift eval --model_type qwen1half-7b-chat \
+ --eval_dataset ceval mmlu arc gsm8k --infer_backend vllm
+```
+
+LoRA fine-tuned:
 ```shell
-CUDA_VISIBLE_DEVICES=0 swift eval --model_type qwen1half-7b-chat --eval_dataset mmlu ceval
+CUDA_VISIBLE_DEVICES=0 swift eval --ckpt_dir xxx/checkpoint-xxx \
+ --eval_dataset ceval mmlu arc gsm8k --infer_backend vllm \
+ --merge_lora true \
 ```
 
 ### Quantization

diff --git a/README_CN.md b/README_CN.md
@@ -436,8 +436,18 @@ CUDA_VISIBLE_DEVICES=0 swift infer \
 
 ### 评测
 
+原始模型:
+```shell
+# 推荐使用vLLM加速 (半分钟评测完arc):
+CUDA_VISIBLE_DEVICES=0 swift eval --model_type qwen1half-7b-chat \
+ --eval_dataset ceval mmlu arc gsm8k --infer_backend vllm
+```
+
+LoRA微调后:
 ```shell
-CUDA_VISIBLE_DEVICES=0 swift eval --model_type qwen1half-7b-chat --eval_dataset mmlu ceval
+CUDA_VISIBLE_DEVICES=0 swift eval --ckpt_dir xxx/checkpoint-xxx \
+ --eval_dataset ceval mmlu arc gsm8k --infer_backend vllm \
+ --merge_lora true \
 ```
 
 ### 量化

diff --git a/docs/source/LLM/LLM评测文档.md b/docs/source/LLM/LLM评测文档.md
@@ -59,29 +59,33 @@ pip install -e '.[eval]'
 
 ## 评测
 
-评测的命令非常简单，只需要使用如下命令即可：
+评测支持使用vLLM加速. 这里展示对原始模型和LoRA微调后的qwen2-7b-instruct进行评测.
 
 ```shell
-# 使用arc评测，每个子数据集限制评测10条，推理backend使用pt
-swift eval \
- --model_type "qwen-7b-chat" \
- --eval_dataset arc \
- --eval_limit 10 \
- --infer_backend pt
+# 原始模型 (单卡A100大约需要半小时)
+CUDA_VISIBLE_DEVCIES=0 swift eval --model_type qwen2-7b-instruct \
+ --eval_dataset ceval mmlu arc gsm8k --infer_backend vllm
+
+# LoRA微调后
+CUDA_VISIBLE_DEVICES=0 swift eval --ckpt_dir qwen2-7b-instruct/vx-xxx/checkpoint-xxx \
+ --eval_dataset ceval mmlu arc gsm8k --infer_backend vllm \
+ --merge_lora true \
 ```
 
 评测的参数列表可以参考[这里](./命令行参数.md#eval参数)。
 
-评测结果展示如下：
 
-```text
-2024-04-10 17:18:45,861 - llmuses - INFO - *** Report table ***
-+---------+-----------+
-| Model | arc |
-+=========+===========+
-| | 0.8 (acc) |
-+---------+-----------+
-Final report:{'report': [{'name': 'arc', 'metric': 'WeightedAverageAccuracy', 'score': 0.8, 'category': [{'name': 'DEFAULT', 'score': 0.8, 'subset': [{'name': 'ARC-Challenge', 'score': 0.8}]}], 'total_num': 10}], 'generation_info': {'time': 80.44219398498535, 'tokens': 743}}
+### 使用部署的方式评测
+
+```shell
+# 使用OpenAI API方式启动部署
+CUDA_VISIBLE_DEVICES=0 swift deploy --model_type qwen2-7b-instruct
+
+# 使用API进行评测
+# 如果是非swift部署, 则需要额外传入`--eval_is_chat_model true --model_type qwen2-7b-instruct`
+swift eval --eval_url http:https://127.0.0.1:8000/v1 --eval_dataset ceval mmlu arc gsm8k
+
+# LoRA微调后的模型同理
 ```
 
 ## 自定义评测集

diff --git a/docs/source/LLM/NPU推理与微调最佳实践.md b/docs/source/LLM/NPU推理与微调最佳实践.md
@@ -121,7 +121,7 @@ Legend:
 
 ### 单卡训练
 
-通过如下命令启动单卡微调:
+通过如下命令启动单卡微调: （注意: 如果微调期间出现nan的情况, 请设置`--dtype fp32`.）
 
 ```shell
 # 实验环境: 昇腾910B3

diff --git a/docs/source/LLM/VLLM推理加速与部署.md b/docs/source/LLM/VLLM推理加速与部署.md
@@ -267,7 +267,7 @@ curl http:https://localhost:8000/v1/chat/completions \
 }'
 ```
 
-使用swift:
+使用swift的同步客户端接口:
 ```python
 from swift.llm import get_model_list_client, XRequestConfig, inference_client
 
@@ -301,7 +301,45 @@ response: 杭州有许多美食，例如西湖醋鱼、东坡肉、龙井虾仁
 """
 ```
 
-使用openai:
+使用swift的异步客户端接口:
+```python
+import asyncio
+from swift.llm import get_model_list_client, XRequestConfig, inference_client_async
+
+model_list = get_model_list_client()
+model_type = model_list.data[0].id
+print(f'model_type: {model_type}')
+
+query = '浙江的省会在哪里?'
+request_config = XRequestConfig(seed=42)
+resp = asyncio.run(inference_client_async(model_type, query, request_config=request_config))
+response = resp.choices[0].message.content
+print(f'query: {query}')
+print(f'response: {response}')
+
+async def _stream():
+ global query
+ history = [(query, response)]
+ query = '这有什么好吃的?'
+ request_config = XRequestConfig(stream=True, seed=42)
+ stream_resp = await inference_client_async(model_type, query, history, request_config=request_config)
+ print(f'query: {query}')
+ print('response: ', end='')
+ async for chunk in stream_resp:
+ print(chunk.choices[0].delta.content, end='', flush=True)
+ print()
+
+asyncio.run(_stream())
+"""Out[0]
+model_type: qwen-7b-chat
+query: 浙江的省会在哪里?
+response: 浙江省的省会是杭州市。
+query: 这有什么好吃的?
+response: 杭州有许多美食，例如西湖醋鱼、东坡肉、龙井虾仁、叫化童子鸡等。此外，杭州还有许多特色小吃，如西湖藕粉、杭州小笼包、杭州油条等。
+"""
+```
+
+使用openai（同步）:
 ```python
 from openai import OpenAI
 client = OpenAI(
@@ -373,7 +411,7 @@ curl http:https://localhost:8000/v1/completions \
 }'
 ```
 
-使用swift:
+使用swift的同步客户端接口:
 ```python
 from swift.llm import get_model_list_client, XRequestConfig, inference_client
 
@@ -420,7 +458,59 @@ response: 成都
 """
 ```
 
-使用openai:
+使用swift的异步客户端接口:
+```python
+import asyncio
+from swift.llm import get_model_list_client, XRequestConfig, inference_client_async
+
+model_list = get_model_list_client()
+model_type = model_list.data[0].id
+print(f'model_type: {model_type}')
+
+query = '浙江 -> 杭州\n安徽 -> 合肥\n四川 ->'
+request_config = XRequestConfig(max_tokens=32, temperature=0.1, seed=42)
+
+resp = asyncio.run(inference_client_async(model_type, query, request_config=request_config))
+response = resp.choices[0].text
+print(f'query: {query}')
+print(f'response: {response}')
+
+async def _stream():
+ request_config.stream = True
+ stream_resp = await inference_client_async(model_type, query, request_config=request_config)
+ print(f'query: {query}')
+ print('response: ', end='')
+ async for chunk in stream_resp:
+ print(chunk.choices[0].text, end='', flush=True)
+ print()
+
+asyncio.run(_stream())
+"""Out[0]
+model_type: qwen-7b
+query: 浙江 -> 杭州
+安徽 -> 合肥
+四川 ->
+response: 成都
+广东 -> 广州
+江苏 -> 南京
+浙江 -> 杭州
+安徽 -> 合肥
+四川 -> 成都
+
+query: 浙江 -> 杭州
+安徽 -> 合肥
+四川 ->
+response: 成都
+广东 -> 广州
+江苏 -> 南京
+浙江 -> 杭州
+安徽 -> 合肥
+四川 -> 成都
+"""
+```
+
+
+使用openai（同步）:
 ```python
 from openai import OpenAI
 client = OpenAI(

diff --git a/docs/source/LLM/命令行参数.md b/docs/source/LLM/命令行参数.md
@@ -307,29 +307,21 @@ export参数继承了infer参数, 除此之外增加了以下参数:
 
 ## eval参数
 
-eval参数继承了infer参数，除此之外增加了以下参数：
-
-- `--name`: 默认为`None`. 评测的名字, 最后的评测结果会存储在命名为`{{model_type}-{name}}`的文件夹中.
-
-- `--eval_dataset`: 评测的官方数据集，默认值为`['ceval', 'gsm8k', 'arc']`, 此外支持`mmlu`和`bbh`两个数据集. 如果仅需要评测自定义数据集，可以将该参数设置为`no`.
+eval参数继承了infer参数，除此之外增加了以下参数：（注意: infer中的generation_config参数将失效, 由[evalscope](https://github.com/modelscope/eval-scope)控制.）
 
+- `--eval_dataset`: 评测的官方数据集，默认值为`['ceval', 'gsm8k', 'arc']`, 你可以填入的值包括: 'arc', 'gsm8k', 'mmlu', 'cmmlu', 'ceval', 'bbh', 'general_qa'. 如果仅需要评测自定义数据集，可以将该参数设置为`no`.
+- `--eval_few_shot`: 每个评测集的子数据集的few-shot个数, 默认为`None`, 即使用数据集的默认配置.
 - `--eval_limit`: 每个评测集的子数据集的采样数量, 默认为`None`代表全量评测.
-
-- `--eval_few_shot`: 每个评测集的子数据集的few-shot个数, 默认为`None`代表使用数据集默认配置.
-
-- `--custom_eval_config`: 使用自定义数据集进行评测, 需要是一个本地存在的文件路径, 文件格式详见[自定义评测集](./LLM评测文档.md#自定义评测集).
-
-- `--eval_use_cache`: 是否使用已经生成的评测缓存, 使做过的评测不会重新运行而只是重新生成评测结果. 默认`False`.
-
-- `--eval_url`: OpenAI标准的模型调用接口, 例如`http:https://127.0.0.1:8000/v1`
-
+- `--name`: 用于区分相同配置评估的结果存储路径, 默认使用当前的时间.
+- `--eval_url`: OpenAI标准的模型调用接口, 例如`http:https://127.0.0.1:8000/v1`. 如果使用部署的方式评估, 则需要进行设置, 通常不需要设置. 默认为`None`.
  ```shell
  swift eval --eval_url http:https://127.0.0.1:8000/v1 --eval_is_chat_model true --model_type gpt4 --eval_token xxx
  ```
+- `--eval_token`: OpenAI标准的模型调用接口的token, 默认为`'EMPTY'`, 代表没有token.
+- `--eval_is_chat_model`: 如果`eval_url`不为空, 则需要传入本值判断是否为`chat`模型, False代表为`base`模型. 默认为`None`.
+- `--custom_eval_config`: 使用自定义数据集进行评测, 需要是一个本地存在的文件路径, 文件格式详见[自定义评测集](./LLM评测文档.md#自定义评测集). 默认为`None`.
+- `--eval_use_cache`: 是否使用已经生成的评测缓存, 使做过的评测不会重新运行而只是重新生成评测结果. 默认`False`.
 
-- `--eval_is_chat_model`: 如果`eval_url`不为空, 则需要传入本值判断是否为`chat`模型, False代表为`base`模型.
-
-- `--eval_token`: OpenAI标准的模型调用接口的token, 默认为`EMPTY`, 代表没有token.
 
 ## app-ui 参数
 

diff --git a/docs/source_en/LLM/Command-line-parameters.md b/docs/source_en/LLM/Command-line-parameters.md
@@ -6,6 +6,7 @@
 - [dpo Parameters](#dpo-parameters)
 - [merge-lora infer Parameters](#merge-lora-infer-parameters)
 - [export Parameters](#export-parameters)
+- [eval Parameters](#eval-parameters)
 - [app-ui Parameters](#app-ui-parameters)
 - [deploy Parameters](#deploy-parameters)
 
@@ -307,28 +308,21 @@ export parameters inherit from infer parameters, with the following added parame
 
 ## eval parameters
 
-The eval parameters inherit from the infer parameters, and additionally include the following parameters:
-
-- `--name`: Default is `None`. The name of the evaluation, the final evaluation results will be stored in a folder named `{{model_type}-{name}}`.
-
-- `--eval_dataset`: The official dataset for evaluation, the default value is `['ceval', 'gsm8k', 'arc']`, and `mmlu` and `bbh` datasets are also supported. If you only need to evaluate a custom dataset, you can set this parameter to `no`.
-
-- `--eval_limit`: The number of samples for each sub-dataset of the evaluation set, default is `None` which means full evaluation.
-
-- `--eval_few_shot`: The number of few-shot instances for each sub-dataset of the evaluation set, default is `None` which means using the default configuration of the dataset.
-
-- `--custom_eval_config`: Use a custom dataset for evaluation, this should be a local file path, the file format is described in [Custom Evaluation Set](./LLM-eval.md#Custom-Evaluation-Set).
-
-- `--eval_use_cache`: Whether to use the evaluation cache, if True, the eval process will only refresh the eval results. Default `False`.
-
-- `--eval_url`: The url of OpenAI standard model service. For example: `http:https://127.0.0.1:8000/v1`.
+The eval parameters inherit from the infer parameters, and additionally include the following parameters: (Note: The generation_config parameter in infer will be invalid, controlled by [evalscope](https://github.com/modelscope/eval-scope).)
 
+- `--eval_dataset`: The official dataset for evaluation, with a default value of `['ceval', 'gsm8k', 'arc']`. Possible values include: 'arc', 'gsm8k', 'mmlu', 'cmmlu', 'ceval', 'bbh', 'general_qa'. If only custom datasets need to be evaluated, this parameter can be set to `no`.
+- `--eval_few_shot`: The few-shot number of sub-datasets for each evaluation set, with a default value of `None`, meaning to use the default configuration of the dataset.
+- `--eval_limit`: The sampling quantity for each sub-dataset of the evaluation set, with a default value of `None` indicating full-scale evaluation.
+- `--name`: Used to differentiate the result storage path for evaluating the same configuration, with the current time as the default.
+- `--eval_url`: The standard model invocation interface for OpenAI, for example, `http:https://127.0.0.1:8000/v1`. This needs to be set when evaluating in a deployed manner, usually not needed. Default is `None`.
  ```shell
  swift eval --eval_url http:https://127.0.0.1:8000/v1 --eval_is_chat_model true --model_type gpt4 --eval_token xxx
  ```
+- `--eval_token`: The token for the standard model invocation interface for OpenAI, with a default value of `'EMPTY'`, indicating no token.
+- `--eval_is_chat_model`: If `eval_url` is not empty, this value needs to be passed to determine if it is a "chat" model. False represents a "base" model. Default is `None`.
+- `--custom_eval_config`: Used for evaluating with custom datasets, and needs to be a locally existing file path. For details on file format, refer to [Custom Evaluation Set](./LLM-eval.md#Custom-Evaluation-Set). Default is `None`.
+- `--eval_use_cache`: Whether to use already generated evaluation cache, so that previously evaluated results won't be rerun but only the evaluation results regenerated. Default is `False`.
 
-- `--eval_is_chat_model`: If `eval_url` is not None, `eval_is_chat_model` must be passed to tell the url calls a chat or a base model.
-- `--eval_token`: The token of the `eval_url`, default value `EMPTY` means token is not needed.
 
 ## app-ui Parameters
 

diff --git a/docs/source_en/LLM/LLM-eval.md b/docs/source_en/LLM/LLM-eval.md
@@ -59,29 +59,32 @@ pip install -e '.[eval]'
 
 ## Evaluation
 
-The command for evaluation is very simple. You only need to use the following command:
+Evaluation supports the use of vLLM for acceleration. Here we demonstrate the evaluation of the original model and the LoRA fine-tuned qwen2-7b-instruct.
 
 ```shell
-# Use the arc evaluation set, limit the evaluation to 10 samples for each subset, and use pt as the inference backend
-swift eval \
- --model_type "qwen-7b-chat" \
- --eval_dataset arc \
- --eval_limit 10 \
- --infer_backend pt
+# Original model (approximately half an hour on a single A100)
+CUDA_VISIBLE_DEVCIES=0 swift eval --model_type qwen2-7b-instruct \
+ --eval_dataset ceval mmlu arc gsm8k --infer_backend vllm
+
+# After LoRA fine-tuning
+CUDA_VISIBLE_DEVICES=0 swift eval --ckpt_dir qwen2-7b-instruct/vx-xxx/checkpoint-xxx \
+ --eval_dataset ceval mmlu arc gsm8k --infer_backend vllm \
+ --merge_lora true \
 ```
 
 You can refer to [here](./Command-line-parameters.md#eval-parameters) for the list of evaluation parameters.
 
-The evaluation result will be displayed as follows:
+### Evaluation using the deployed method
 
-```text
-2024-04-10 17:18:45,861 - llmuses - INFO - *** Report table ***
-+---------+-----------+
-| Model | arc |
-+=========+===========+
-| | 0.8 (acc) |
-+---------+-----------+
-Final report:{'report': [{'name': 'arc', 'metric': 'WeightedAverageAccuracy', 'score': 0.8, 'category': [{'name': 'DEFAULT', 'score': 0.8, 'subset': [{'name': 'ARC-Challenge', 'score': 0.8}]}], 'total_num': 10}], 'generation_info': {'time': 80.44219398498535, 'tokens': 743}}
+```shell
+# Start deployment using the OpenAI API method
+CUDA_VISIBLE_DEVICES=0 swift deploy --model_type qwen2-7b-instruct
+
+# Evaluate using the API
+# If it is not a Swift deployment, you need to additionally pass in `--eval_is_chat_model true --model_type qwen2-7b-instruct`.
+swift eval --eval_url http:https://127.0.0.1:8000/v1 --eval_dataset ceval mmlu arc gsm8k
+
+# The same applies to the model after LoRA fine-tuning.
 ```
 
 ## Custom Evaluation Set