Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error with VLLM #1136

Closed
didoll-john opened this issue Jun 12, 2024 · 5 comments
Closed

Error with VLLM #1136

didoll-john opened this issue Jun 12, 2024 · 5 comments

Comments

@didoll-john
Copy link

I started the VLLM via docker, command is:
docker run --runtime nvidia --gpus all -d --restart always -v ~/data/.cache/huggingface:/root/.cache/huggingface -v /data/LLM_models/Qwen/Qwen2-72B-Instruct-GPTQ-Int4:/data/Qwen2-72B-Instruct-GPTQ-Int4 -p 8000:8000 --ipc=host vllm/vllm-openai:latest --served-model-name Qwen2-72B-Instruct-GPTQ-Int4 --model /data/Qwen2-72B-Instruct-GPTQ-Int4 --tensor-parallel-size 4
do the curl test:
curl http:https://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{ "model": "Qwen2-72B-Instruct-GPTQ-Int4", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "give me the answer of 1+1"} ] }'
it works well.
{"id":"cmpl-e44f7f809f5b4ebc82eff1e96c55ad1b","object":"chat.completion","created":1718184574,"model":"Qwen2-72B-Instruct-GPTQ-Int4","choices":[{"index":0,"message":{"role":"assistant","content":"The answer to 1 + 1 is 2.","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":28,"total_tokens":41,"completion_tokens":13}}
Then I use it in the dspy like:
vllm_qwen = dspy.HFClientVLLM(model="Qwen2-72B-Instruct-GPTQ-Int4", port=8000, url="http:https://localhost")
the error code is:
Failed to parse JSON response: {"object":"error","message":"The model Qwen2-72B-Instruct-GPTQ-Int4 does not exist.","type":"NotFoundError","param":null,"code":404}
then I try to use it like this:
vllm_qwen = dspy.OpenAI(model="Qwen2-72B-Instruct-GPTQ-Int4", api_base="http:https://localhost:8000/v1", api_key='EMPTY')
got another error code:
openai.NotFoundError: Error code: 404 - {'detail': 'Not Found'}

@tom-doerr
Copy link
Contributor

Could it make sense to look at the container logs? docker logs <container_id>

@didoll-john
Copy link
Author

didoll-john commented Jun 12, 2024

INFO 06-12 12:36:44 api_server.py:177] vLLM API server version 0.5.0
INFO 06-12 12:36:44 api_server.py:178] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=['*'], api_key=None, lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='/data/Qwen2-72B-Instruct-GPTQ-Int4', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=4, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, device='auto', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, image_processor=None, image_processor_revision=None, disable_image_processor=False, scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, model_loader_extra_config=None, preemption_mode=None, served_model_name=['Qwen2-72B-Instruct-GPTQ-Int4'], qlora_adapter_name_or_path=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
INFO 06-12 12:36:44 gptq_marlin.py:133] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
2024-06-12 12:36:47,341 INFO worker.py:1753 -- Started a local Ray instance.
INFO 06-12 12:36:48 config.py:623] Defaulting to use mp for distributed inference
INFO 06-12 12:36:48 llm_engine.py:161] Initializing an LLM engine (v0.5.0) with config: model='/data/Qwen2-72B-Instruct-GPTQ-Int4', speculative_config=None, tokenizer='/data/Qwen2-72B-Instruct-GPTQ-Int4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, disable_custom_all_reduce=False, quantization=gptq_marlin, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=Qwen2-72B-Instruct-GPTQ-Int4)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
(VllmWorkerProcess pid=4699) INFO 06-12 12:36:59 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=4701) INFO 06-12 12:36:59 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=4700) INFO 06-12 12:36:59 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=4700) INFO 06-12 12:36:59 utils.py:623] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=4701) INFO 06-12 12:36:59 utils.py:623] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=4700) INFO 06-12 12:36:59 pynccl.py:65] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=4701) INFO 06-12 12:36:59 pynccl.py:65] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=4699) INFO 06-12 12:36:59 utils.py:623] Found nccl from library libnccl.so.2
INFO 06-12 12:36:59 utils.py:623] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=4699) INFO 06-12 12:36:59 pynccl.py:65] vLLM is using nccl==2.20.5
INFO 06-12 12:36:59 pynccl.py:65] vLLM is using nccl==2.20.5
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/resource_tracker.py", line 209, in main
cache[rtype].remove(name)
KeyError: '/psm_e421e9cd'
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/resource_tracker.py", line 209, in main
cache[rtype].remove(name)
KeyError: '/psm_e421e9cd'
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/resource_tracker.py", line 209, in main
cache[rtype].remove(name)
KeyError: '/psm_e421e9cd'
WARNING 06-12 12:37:00 custom_all_reduce.py:170] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorkerProcess pid=4701) WARNING 06-12 12:37:00 custom_all_reduce.py:170] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorkerProcess pid=4699) WARNING 06-12 12:37:00 custom_all_reduce.py:170] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorkerProcess pid=4700) WARNING 06-12 12:37:00 custom_all_reduce.py:170] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorkerProcess pid=4699) INFO 06-12 12:39:20 model_runner.py:159] Loading model weights took 9.7410 GB
INFO 06-12 12:39:20 model_runner.py:159] Loading model weights took 9.7410 GB
(VllmWorkerProcess pid=4701) INFO 06-12 12:39:20 model_runner.py:159] Loading model weights took 9.7410 GB
(VllmWorkerProcess pid=4700) INFO 06-12 12:39:20 model_runner.py:159] Loading model weights took 9.7410 GB
INFO 06-12 12:39:58 distributed_gpu_executor.py:56] # GPU blocks: 6049, # CPU blocks: 3276
(VllmWorkerProcess pid=4699) INFO 06-12 12:40:03 model_runner.py:878] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(VllmWorkerProcess pid=4699) INFO 06-12 12:40:03 model_runner.py:882] CUDA graphs can take additional 13 GiB memory per GPU. If you are running out of memory, consider decreasing gpu_memory_utilization or enforcing eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage.
(VllmWorkerProcess pid=4700) INFO 06-12 12:40:03 model_runner.py:878] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(VllmWorkerProcess pid=4700) INFO 06-12 12:40:03 model_runner.py:882] CUDA graphs can take additional 1
3 GiB memory per GPU. If you are running out of memory, consider decreasing gpu_memory_utilization or enforcing eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage.
INFO 06-12 12:40:03 model_runner.py:878] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 06-12 12:40:03 model_runner.py:882] CUDA graphs can take additional 13 GiB memory per GPU. If you are running out of memory, consider decreasing gpu_memory_utilization or enforcing eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage.
(VllmWorkerProcess pid=4701) INFO 06-12 12:40:03 model_runner.py:878] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(VllmWorkerProcess pid=4701) INFO 06-12 12:40:03 model_runner.py:882] CUDA graphs can take additional 1
3 GiB memory per GPU. If you are running out of memory, consider decreasing gpu_memory_utilization or enforcing eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage.
(VllmWorkerProcess pid=4699) INFO 06-12 12:40:30 model_runner.py:954] Graph capturing finished in 27 secs.
(VllmWorkerProcess pid=4700) INFO 06-12 12:40:30 model_runner.py:954] Graph capturing finished in 27 secs.
(VllmWorkerProcess pid=4701) INFO 06-12 12:40:30 model_runner.py:954] Graph capturing finished in 26 secs.
INFO 06-12 12:40:30 model_runner.py:954] Graph capturing finished in 27 secs.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 06-12 12:40:30 serving_chat.py:92] Using default chat template:
INFO 06-12 12:40:30 serving_chat.py:92] {% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system
INFO 06-12 12:40:30 serving_chat.py:92] You are a helpful assistant.<|im_end|>
INFO 06-12 12:40:30 serving_chat.py:92] ' }}{% endif %}{{'<|im_start|>' + message['role'] + '
INFO 06-12 12:40:30 serving_chat.py:92] ' + message['content'] + '<|im_end|>' + '
INFO 06-12 12:40:30 serving_chat.py:92] '}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant
INFO 06-12 12:40:30 serving_chat.py:92] ' }}{% endif %}
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
WARNING 06-12 12:40:31 serving_embedding.py:141] embedding_mode is False. Embedding API will not work.
INFO: Started server process [1]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http:https://0.0.0.0:8000 (Press CTRL+C to quit)
INFO 06-12 12:40:41 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-12 12:40:51 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-12 12:41:01 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-12 12:41:11 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-12 12:41:21 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-12 12:41:31 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-12 12:41:41 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-12 12:41:51 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-12 12:42:01 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-12 12:42:11 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-12 12:42:21 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-12 12:42:31 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-12 12:42:41 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-12 12:42:51 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-12 12:43:01 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-12 12:43:11 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-12 12:43:21 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-12 12:43:31 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-12 12:43:41 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-12 12:43:51 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-12 12:44:01 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.

above is the log when I use:
vllm_qwen = dspy.HFClientVLLM(model="Qwen2-72B-Instruct-GPTQ-Int4", port=8000, url="http:https://localhost")
got the error:
Failed to parse JSON response: {"object":"error","message":"The model Qwen2-72B-Instruct-GPTQ-Int4 does not exist.","type":"NotFoundError","param":null,"code":404}

you can see that there is no exactly log for this request.

@didoll-john
Copy link
Author

And I found that I can use it as openai api now, the point is:
vllm_qwen = dspy.OpenAI(model="Qwen2-72B-Instruct-GPTQ-Int4", api_base="http:https://localhost:8000/v1/", api_key='EMPTY')
There should be a "/" after "v1"

@fivejjs
Copy link
Contributor

fivejjs commented Jun 13, 2024

I will try openAI API with VLLM

@fivejjs
Copy link
Contributor

fivejjs commented Jun 14, 2024

And I found that I can use it as openai api now, the point is: vllm_qwen = dspy.OpenAI(model="Qwen2-72B-Instruct-GPTQ-Int4", api_base="http:https://localhost:8000/v1/", api_key='EMPTY') There should be a "/" after "v1"

the solution worked for hugging face models which could be updated into:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants