Skip to content

Commit

Permalink
Support batched eval (#1110)
Browse files Browse the repository at this point in the history
  • Loading branch information
Jintao-Huang committed Jun 11, 2024
1 parent ddfc9fc commit ba98cac
Show file tree
Hide file tree
Showing 18 changed files with 545 additions and 223 deletions.
12 changes: 11 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -440,8 +440,18 @@ CUDA_VISIBLE_DEVICES=0 swift infer \

### Evaluation

Original model:
```shell
# We recommend using vLLM for acceleration (arc evaluated in half a minute)
CUDA_VISIBLE_DEVICES=0 swift eval --model_type qwen1half-7b-chat \
--eval_dataset ceval mmlu arc gsm8k --infer_backend vllm
```

LoRA fine-tuned:
```shell
CUDA_VISIBLE_DEVICES=0 swift eval --model_type qwen1half-7b-chat --eval_dataset mmlu ceval
CUDA_VISIBLE_DEVICES=0 swift eval --ckpt_dir xxx/checkpoint-xxx \
--eval_dataset ceval mmlu arc gsm8k --infer_backend vllm \
--merge_lora true \
```

### Quantization
Expand Down
12 changes: 11 additions & 1 deletion README_CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -436,8 +436,18 @@ CUDA_VISIBLE_DEVICES=0 swift infer \

### 评测

原始模型:
```shell
# 推荐使用vLLM加速 (半分钟评测完arc):
CUDA_VISIBLE_DEVICES=0 swift eval --model_type qwen1half-7b-chat \
--eval_dataset ceval mmlu arc gsm8k --infer_backend vllm
```

LoRA微调后:
```shell
CUDA_VISIBLE_DEVICES=0 swift eval --model_type qwen1half-7b-chat --eval_dataset mmlu ceval
CUDA_VISIBLE_DEVICES=0 swift eval --ckpt_dir xxx/checkpoint-xxx \
--eval_dataset ceval mmlu arc gsm8k --infer_backend vllm \
--merge_lora true \
```

### 量化
Expand Down
36 changes: 20 additions & 16 deletions docs/source/LLM/LLM评测文档.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,29 +59,33 @@ pip install -e '.[eval]'

## 评测

评测的命令非常简单,只需要使用如下命令即可:
评测支持使用vLLM加速. 这里展示对原始模型和LoRA微调后的qwen2-7b-instruct进行评测.

```shell
# 使用arc评测,每个子数据集限制评测10条,推理backend使用pt
swift eval \
--model_type "qwen-7b-chat" \
--eval_dataset arc \
--eval_limit 10 \
--infer_backend pt
# 原始模型 (单卡A100大约需要半小时)
CUDA_VISIBLE_DEVCIES=0 swift eval --model_type qwen2-7b-instruct \
--eval_dataset ceval mmlu arc gsm8k --infer_backend vllm

# LoRA微调后
CUDA_VISIBLE_DEVICES=0 swift eval --ckpt_dir qwen2-7b-instruct/vx-xxx/checkpoint-xxx \
--eval_dataset ceval mmlu arc gsm8k --infer_backend vllm \
--merge_lora true \
```

评测的参数列表可以参考[这里](./命令行参数.md#eval参数)

评测结果展示如下:

```text
2024-04-10 17:18:45,861 - llmuses - INFO - *** Report table ***
+---------+-----------+
| Model | arc |
+=========+===========+
| | 0.8 (acc) |
+---------+-----------+
Final report:{'report': [{'name': 'arc', 'metric': 'WeightedAverageAccuracy', 'score': 0.8, 'category': [{'name': 'DEFAULT', 'score': 0.8, 'subset': [{'name': 'ARC-Challenge', 'score': 0.8}]}], 'total_num': 10}], 'generation_info': {'time': 80.44219398498535, 'tokens': 743}}
### 使用部署的方式评测

```shell
# 使用OpenAI API方式启动部署
CUDA_VISIBLE_DEVICES=0 swift deploy --model_type qwen2-7b-instruct

# 使用API进行评测
# 如果是非swift部署, 则需要额外传入`--eval_is_chat_model true --model_type qwen2-7b-instruct`
swift eval --eval_url http:https://127.0.0.1:8000/v1 --eval_dataset ceval mmlu arc gsm8k

# LoRA微调后的模型同理
```

## 自定义评测集
Expand Down
2 changes: 1 addition & 1 deletion docs/source/LLM/NPU推理与微调最佳实践.md
Original file line number Diff line number Diff line change
Expand Up @@ -121,7 +121,7 @@ Legend:

### 单卡训练

通过如下命令启动单卡微调:
通过如下命令启动单卡微调: (注意: 如果微调期间出现nan的情况, 请设置`--dtype fp32`.)

```shell
# 实验环境: 昇腾910B3
Expand Down
98 changes: 94 additions & 4 deletions docs/source/LLM/VLLM推理加速与部署.md
Original file line number Diff line number Diff line change
Expand Up @@ -267,7 +267,7 @@ curl http:https://localhost:8000/v1/chat/completions \
}'
```

使用swift:
使用swift的同步客户端接口:
```python
from swift.llm import get_model_list_client, XRequestConfig, inference_client

Expand Down Expand Up @@ -301,7 +301,45 @@ response: 杭州有许多美食,例如西湖醋鱼、东坡肉、龙井虾仁
"""
```

使用openai:
使用swift的异步客户端接口:
```python
import asyncio
from swift.llm import get_model_list_client, XRequestConfig, inference_client_async

model_list = get_model_list_client()
model_type = model_list.data[0].id
print(f'model_type: {model_type}')

query = '浙江的省会在哪里?'
request_config = XRequestConfig(seed=42)
resp = asyncio.run(inference_client_async(model_type, query, request_config=request_config))
response = resp.choices[0].message.content
print(f'query: {query}')
print(f'response: {response}')

async def _stream():
global query
history = [(query, response)]
query = '这有什么好吃的?'
request_config = XRequestConfig(stream=True, seed=42)
stream_resp = await inference_client_async(model_type, query, history, request_config=request_config)
print(f'query: {query}')
print('response: ', end='')
async for chunk in stream_resp:
print(chunk.choices[0].delta.content, end='', flush=True)
print()

asyncio.run(_stream())
"""Out[0]
model_type: qwen-7b-chat
query: 浙江的省会在哪里?
response: 浙江省的省会是杭州市。
query: 这有什么好吃的?
response: 杭州有许多美食,例如西湖醋鱼、东坡肉、龙井虾仁、叫化童子鸡等。此外,杭州还有许多特色小吃,如西湖藕粉、杭州小笼包、杭州油条等。
"""
```

使用openai(同步):
```python
from openai import OpenAI
client = OpenAI(
Expand Down Expand Up @@ -373,7 +411,7 @@ curl http:https://localhost:8000/v1/completions \
}'
```

使用swift:
使用swift的同步客户端接口:
```python
from swift.llm import get_model_list_client, XRequestConfig, inference_client

Expand Down Expand Up @@ -420,7 +458,59 @@ response: 成都
"""
```

使用openai:
使用swift的异步客户端接口:
```python
import asyncio
from swift.llm import get_model_list_client, XRequestConfig, inference_client_async

model_list = get_model_list_client()
model_type = model_list.data[0].id
print(f'model_type: {model_type}')

query = '浙江 -> 杭州\n安徽 -> 合肥\n四川 ->'
request_config = XRequestConfig(max_tokens=32, temperature=0.1, seed=42)

resp = asyncio.run(inference_client_async(model_type, query, request_config=request_config))
response = resp.choices[0].text
print(f'query: {query}')
print(f'response: {response}')

async def _stream():
request_config.stream = True
stream_resp = await inference_client_async(model_type, query, request_config=request_config)
print(f'query: {query}')
print('response: ', end='')
async for chunk in stream_resp:
print(chunk.choices[0].text, end='', flush=True)
print()

asyncio.run(_stream())
"""Out[0]
model_type: qwen-7b
query: 浙江 -> 杭州
安徽 -> 合肥
四川 ->
response: 成都
广东 -> 广州
江苏 -> 南京
浙江 -> 杭州
安徽 -> 合肥
四川 -> 成都
query: 浙江 -> 杭州
安徽 -> 合肥
四川 ->
response: 成都
广东 -> 广州
江苏 -> 南京
浙江 -> 杭州
安徽 -> 合肥
四川 -> 成都
"""
```


使用openai(同步):
```python
from openai import OpenAI
client = OpenAI(
Expand Down
26 changes: 9 additions & 17 deletions docs/source/LLM/命令行参数.md
Original file line number Diff line number Diff line change
Expand Up @@ -307,29 +307,21 @@ export参数继承了infer参数, 除此之外增加了以下参数:

## eval参数

eval参数继承了infer参数,除此之外增加了以下参数:

- `--name`: 默认为`None`. 评测的名字, 最后的评测结果会存储在命名为`{{model_type}-{name}}`的文件夹中.

- `--eval_dataset`: 评测的官方数据集,默认值为`['ceval', 'gsm8k', 'arc']`, 此外支持`mmlu``bbh`两个数据集. 如果仅需要评测自定义数据集,可以将该参数设置为`no`.
eval参数继承了infer参数,除此之外增加了以下参数:(注意: infer中的generation_config参数将失效, 由[evalscope](https://github.com/modelscope/eval-scope)控制.)

- `--eval_dataset`: 评测的官方数据集,默认值为`['ceval', 'gsm8k', 'arc']`, 你可以填入的值包括: 'arc', 'gsm8k', 'mmlu', 'cmmlu', 'ceval', 'bbh', 'general_qa'. 如果仅需要评测自定义数据集,可以将该参数设置为`no`.
- `--eval_few_shot`: 每个评测集的子数据集的few-shot个数, 默认为`None`, 即使用数据集的默认配置.
- `--eval_limit`: 每个评测集的子数据集的采样数量, 默认为`None`代表全量评测.

- `--eval_few_shot`: 每个评测集的子数据集的few-shot个数, 默认为`None`代表使用数据集默认配置.

- `--custom_eval_config`: 使用自定义数据集进行评测, 需要是一个本地存在的文件路径, 文件格式详见[自定义评测集](./LLM评测文档.md#自定义评测集).

- `--eval_use_cache`: 是否使用已经生成的评测缓存, 使做过的评测不会重新运行而只是重新生成评测结果. 默认`False`.

- `--eval_url`: OpenAI标准的模型调用接口, 例如`http:https://127.0.0.1:8000/v1`

- `--name`: 用于区分相同配置评估的结果存储路径, 默认使用当前的时间.
- `--eval_url`: OpenAI标准的模型调用接口, 例如`http:https://127.0.0.1:8000/v1`. 如果使用部署的方式评估, 则需要进行设置, 通常不需要设置. 默认为`None`.
```shell
swift eval --eval_url http:https://127.0.0.1:8000/v1 --eval_is_chat_model true --model_type gpt4 --eval_token xxx
```
- `--eval_token`: OpenAI标准的模型调用接口的token, 默认为`'EMPTY'`, 代表没有token.
- `--eval_is_chat_model`: 如果`eval_url`不为空, 则需要传入本值判断是否为`chat`模型, False代表为`base`模型. 默认为`None`.
- `--custom_eval_config`: 使用自定义数据集进行评测, 需要是一个本地存在的文件路径, 文件格式详见[自定义评测集](./LLM评测文档.md#自定义评测集). 默认为`None`.
- `--eval_use_cache`: 是否使用已经生成的评测缓存, 使做过的评测不会重新运行而只是重新生成评测结果. 默认`False`.

- `--eval_is_chat_model`: 如果`eval_url`不为空, 则需要传入本值判断是否为`chat`模型, False代表为`base`模型.

- `--eval_token`: OpenAI标准的模型调用接口的token, 默认为`EMPTY`, 代表没有token.

## app-ui 参数

Expand Down
28 changes: 11 additions & 17 deletions docs/source_en/LLM/Command-line-parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
- [dpo Parameters](#dpo-parameters)
- [merge-lora infer Parameters](#merge-lora-infer-parameters)
- [export Parameters](#export-parameters)
- [eval Parameters](#eval-parameters)
- [app-ui Parameters](#app-ui-parameters)
- [deploy Parameters](#deploy-parameters)

Expand Down Expand Up @@ -307,28 +308,21 @@ export parameters inherit from infer parameters, with the following added parame

## eval parameters

The eval parameters inherit from the infer parameters, and additionally include the following parameters:

- `--name`: Default is `None`. The name of the evaluation, the final evaluation results will be stored in a folder named `{{model_type}-{name}}`.

- `--eval_dataset`: The official dataset for evaluation, the default value is `['ceval', 'gsm8k', 'arc']`, and `mmlu` and `bbh` datasets are also supported. If you only need to evaluate a custom dataset, you can set this parameter to `no`.

- `--eval_limit`: The number of samples for each sub-dataset of the evaluation set, default is `None` which means full evaluation.

- `--eval_few_shot`: The number of few-shot instances for each sub-dataset of the evaluation set, default is `None` which means using the default configuration of the dataset.

- `--custom_eval_config`: Use a custom dataset for evaluation, this should be a local file path, the file format is described in [Custom Evaluation Set](./LLM-eval.md#Custom-Evaluation-Set).

- `--eval_use_cache`: Whether to use the evaluation cache, if True, the eval process will only refresh the eval results. Default `False`.

- `--eval_url`: The url of OpenAI standard model service. For example: `http:https://127.0.0.1:8000/v1`.
The eval parameters inherit from the infer parameters, and additionally include the following parameters: (Note: The generation_config parameter in infer will be invalid, controlled by [evalscope](https://github.com/modelscope/eval-scope).)

- `--eval_dataset`: The official dataset for evaluation, with a default value of `['ceval', 'gsm8k', 'arc']`. Possible values include: 'arc', 'gsm8k', 'mmlu', 'cmmlu', 'ceval', 'bbh', 'general_qa'. If only custom datasets need to be evaluated, this parameter can be set to `no`.
- `--eval_few_shot`: The few-shot number of sub-datasets for each evaluation set, with a default value of `None`, meaning to use the default configuration of the dataset.
- `--eval_limit`: The sampling quantity for each sub-dataset of the evaluation set, with a default value of `None` indicating full-scale evaluation.
- `--name`: Used to differentiate the result storage path for evaluating the same configuration, with the current time as the default.
- `--eval_url`: The standard model invocation interface for OpenAI, for example, `http:https://127.0.0.1:8000/v1`. This needs to be set when evaluating in a deployed manner, usually not needed. Default is `None`.
```shell
swift eval --eval_url http:https://127.0.0.1:8000/v1 --eval_is_chat_model true --model_type gpt4 --eval_token xxx
```
- `--eval_token`: The token for the standard model invocation interface for OpenAI, with a default value of `'EMPTY'`, indicating no token.
- `--eval_is_chat_model`: If `eval_url` is not empty, this value needs to be passed to determine if it is a "chat" model. False represents a "base" model. Default is `None`.
- `--custom_eval_config`: Used for evaluating with custom datasets, and needs to be a locally existing file path. For details on file format, refer to [Custom Evaluation Set](./LLM-eval.md#Custom-Evaluation-Set). Default is `None`.
- `--eval_use_cache`: Whether to use already generated evaluation cache, so that previously evaluated results won't be rerun but only the evaluation results regenerated. Default is `False`.

- `--eval_is_chat_model`: If `eval_url` is not None, `eval_is_chat_model` must be passed to tell the url calls a chat or a base model.
- `--eval_token`: The token of the `eval_url`, default value `EMPTY` means token is not needed.

## app-ui Parameters

Expand Down
35 changes: 19 additions & 16 deletions docs/source_en/LLM/LLM-eval.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,29 +59,32 @@ pip install -e '.[eval]'

## Evaluation

The command for evaluation is very simple. You only need to use the following command:
Evaluation supports the use of vLLM for acceleration. Here we demonstrate the evaluation of the original model and the LoRA fine-tuned qwen2-7b-instruct.

```shell
# Use the arc evaluation set, limit the evaluation to 10 samples for each subset, and use pt as the inference backend
swift eval \
--model_type "qwen-7b-chat" \
--eval_dataset arc \
--eval_limit 10 \
--infer_backend pt
# Original model (approximately half an hour on a single A100)
CUDA_VISIBLE_DEVCIES=0 swift eval --model_type qwen2-7b-instruct \
--eval_dataset ceval mmlu arc gsm8k --infer_backend vllm

# After LoRA fine-tuning
CUDA_VISIBLE_DEVICES=0 swift eval --ckpt_dir qwen2-7b-instruct/vx-xxx/checkpoint-xxx \
--eval_dataset ceval mmlu arc gsm8k --infer_backend vllm \
--merge_lora true \
```

You can refer to [here](./Command-line-parameters.md#eval-parameters) for the list of evaluation parameters.

The evaluation result will be displayed as follows:
### Evaluation using the deployed method

```text
2024-04-10 17:18:45,861 - llmuses - INFO - *** Report table ***
+---------+-----------+
| Model | arc |
+=========+===========+
| | 0.8 (acc) |
+---------+-----------+
Final report:{'report': [{'name': 'arc', 'metric': 'WeightedAverageAccuracy', 'score': 0.8, 'category': [{'name': 'DEFAULT', 'score': 0.8, 'subset': [{'name': 'ARC-Challenge', 'score': 0.8}]}], 'total_num': 10}], 'generation_info': {'time': 80.44219398498535, 'tokens': 743}}
```shell
# Start deployment using the OpenAI API method
CUDA_VISIBLE_DEVICES=0 swift deploy --model_type qwen2-7b-instruct

# Evaluate using the API
# If it is not a Swift deployment, you need to additionally pass in `--eval_is_chat_model true --model_type qwen2-7b-instruct`.
swift eval --eval_url http:https://127.0.0.1:8000/v1 --eval_dataset ceval mmlu arc gsm8k

# The same applies to the model after LoRA fine-tuning.
```

## Custom Evaluation Set
Expand Down
Loading

0 comments on commit ba98cac

Please sign in to comment.