Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Qwen-72B-Chat-Int4推理时间 #783

Closed
2 tasks done
zhudongwork opened this issue Dec 12, 2023 · 8 comments
Closed
2 tasks done

Qwen-72B-Chat-Int4推理时间 #783

zhudongwork opened this issue Dec 12, 2023 · 8 comments

Comments

@zhudongwork
Copy link

zhudongwork commented Dec 12, 2023

是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?

  • 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?

  • 我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

Qwen-72B-Chat-Int4使用A100 40G*2 进行推理,时间长达257s,这种情况正常么

期望行为 | Expected Behavior

No response

复现方法 | Steps To Reproduce

import time
from modelscope import AutoTokenizer, AutoModelForCausalLM

# Note: The default behavior now has injection attack prevention off.
tokenizer = AutoTokenizer.from_pretrained("/opt/models/model_repository/Qwen-72B-Chat-Int4", revision='master', trust_remote_code=True)

model = AutoModelForCausalLM.from_pretrained(
    "/opt/models/model_repository/Qwen-72B-Chat-Int4", revision='master',
    device_map="auto",
    trust_remote_code=True
).eval()
start = time.time()
response, history = model.chat(tokenizer, "讲一个小故事", history=None)
end = time.time()
print(response)
print("infer time:", end-start)

运行环境 | Environment

- OS:
- Python:
- Transformers:
- PyTorch:
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):

备注 | Anything else?

No response

@jklj077
Copy link
Contributor

jklj077 commented Dec 12, 2023

输出有多长呢?参考README中速度说明看看 https://github.com/QwenLM/Qwen#inference-performance (额外考虑下transformers推理的话,多卡速度要比单卡慢。)

@zhudongwork
Copy link
Author

zhudongwork commented Dec 12, 2023

输出也不是很长:

在一个寒冷的冬日,小明走在回家的路上。他看见一只小鸟掉在了地上,冻得瑟瑟发抖。小明心疼极了,他把小鸟捧在手心里,用自己的体温温暖它。
过了一会儿,小鸟渐渐恢复了活力。它感激地看了小明一眼,然后飞上了天空。
小明感到非常开心,因为他做了一件好事。他知道,即使是一件小小的事情,也能给世界带来一些温暖和善意。
从那天起,小明更加爱护大自然,关心身边的生命。他的善良和爱心感动了许多人,也让他变得更加自信和快乐。
这个小故事告诉我们,无论我们身处何处,都应该保持善良和关爱之心。只有这样,我们的世界才会变得更加美好。

大概153token

@boquanzhou
Copy link

遇到了同样的问题,4张v100,让他讲个故事花了10多分钟...,该如何解决

@BUJIDAOVS
Copy link

一样,docker部署的72b-int4模型,单卡和双卡推理都非常慢

@sheiy
Copy link

sheiy commented Dec 19, 2023

docker 部署比直接在物理机上运行慢很多
image
docker 推理耗时46秒
image
物理机耗费时2秒
image

物理机环境

5*v100(16G)
Python 3.10.13
NVIDIA-SMI 535.129.03
Driver Version: 535.129.03
CUDA Version: 12.2
PyTorch Version: 2.1.2

@fyabc
Copy link
Contributor

fyabc commented Dec 25, 2023

@sheiy @zhudongwork @BUJIDAOVS @boquanzhou 您好,如果您是在Docker中部署72B量化版本模型的话,推理速度变慢是因为之前docker镜像中的auto-gptq版本存在问题(可参考此issue
目前最新版本的docker镜像已修复此问题,可以拉取最新镜像后再尝试一下。

fyabc added a commit that referenced this issue Dec 25, 2023
@fyabc fyabc mentioned this issue Dec 25, 2023
jklj077 pushed a commit that referenced this issue Dec 25, 2023
@sheiy
Copy link

sheiy commented Dec 25, 2023

@fyabc 感谢

@jklj077 jklj077 closed this as completed Jan 2, 2024
@terence-wu
Copy link

image 单卡A100*80G推理都很慢,配置都改成最大限制60G了

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants