Error with VLLM #1136

didoll-john · 2024-06-12T09:32:58Z

I started the VLLM via docker, command is:
docker run --runtime nvidia --gpus all -d --restart always -v ~/data/.cache/huggingface:/root/.cache/huggingface -v /data/LLM_models/Qwen/Qwen2-72B-Instruct-GPTQ-Int4:/data/Qwen2-72B-Instruct-GPTQ-Int4 -p 8000:8000 --ipc=host vllm/vllm-openai:latest --served-model-name Qwen2-72B-Instruct-GPTQ-Int4 --model /data/Qwen2-72B-Instruct-GPTQ-Int4 --tensor-parallel-size 4
do the curl test:
curl http:https://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{ "model": "Qwen2-72B-Instruct-GPTQ-Int4", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "give me the answer of 1+1"} ] }'
it works well.
{"id":"cmpl-e44f7f809f5b4ebc82eff1e96c55ad1b","object":"chat.completion","created":1718184574,"model":"Qwen2-72B-Instruct-GPTQ-Int4","choices":[{"index":0,"message":{"role":"assistant","content":"The answer to 1 + 1 is 2.","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":28,"total_tokens":41,"completion_tokens":13}}
Then I use it in the dspy like:
vllm_qwen = dspy.HFClientVLLM(model="Qwen2-72B-Instruct-GPTQ-Int4", port=8000, url="http:https://localhost")
the error code is:
Failed to parse JSON response: {"object":"error","message":"The model Qwen2-72B-Instruct-GPTQ-Int4 does not exist.","type":"NotFoundError","param":null,"code":404}
then I try to use it like this:
vllm_qwen = dspy.OpenAI(model="Qwen2-72B-Instruct-GPTQ-Int4", api_base="http:https://localhost:8000/v1", api_key='EMPTY')
got another error code:
openai.NotFoundError: Error code: 404 - {'detail': 'Not Found'}

The text was updated successfully, but these errors were encountered:

tom-doerr · 2024-06-12T10:06:12Z

Could it make sense to look at the container logs? docker logs <container_id>

didoll-john · 2024-06-12T12:48:38Z

INFO 06-12 12:36:44 api_server.py:177] vLLM API server version 0.5.0
INFO 06-12 12:36:44 api_server.py:178] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=['*'], api_key=None, lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='/data/Qwen2-72B-Instruct-GPTQ-Int4', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=4, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, device='auto', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, image_processor=None, image_processor_revision=None, disable_image_processor=False, scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, model_loader_extra_config=None, preemption_mode=None, served_model_name=['Qwen2-72B-Instruct-GPTQ-Int4'], qlora_adapter_name_or_path=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
INFO 06-12 12:36:44 gptq_marlin.py:133] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
2024-06-12 12:36:47,341 INFO worker.py:1753 -- Started a local Ray instance.
INFO 06-12 12:36:48 config.py:623] Defaulting to use mp for distributed inference
INFO 06-12 12:36:48 llm_engine.py:161] Initializing an LLM engine (v0.5.0) with config: model='/data/Qwen2-72B-Instruct-GPTQ-Int4', speculative_config=None, tokenizer='/data/Qwen2-72B-Instruct-GPTQ-Int4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, disable_custom_all_reduce=False, quantization=gptq_marlin, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=Qwen2-72B-Instruct-GPTQ-Int4)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
(VllmWorkerProcess pid=4699) INFO 06-12 12:36:59 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=4701) INFO 06-12 12:36:59 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=4700) INFO 06-12 12:36:59 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=4700) INFO 06-12 12:36:59 utils.py:623] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=4701) INFO 06-12 12:36:59 utils.py:623] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=4700) INFO 06-12 12:36:59 pynccl.py:65] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=4701) INFO 06-12 12:36:59 pynccl.py:65] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=4699) INFO 06-12 12:36:59 utils.py:623] Found nccl from library libnccl.so.2
INFO 06-12 12:36:59 utils.py:623] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=4699) INFO 06-12 12:36:59 pynccl.py:65] vLLM is using nccl==2.20.5
INFO 06-12 12:36:59 pynccl.py:65] vLLM is using nccl==2.20.5
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/resource_tracker.py", line 209, in main
cache[rtype].remove(name)
KeyError: '/psm_e421e9cd'
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/resource_tracker.py", line 209, in main
cache[rtype].remove(name)
KeyError: '/psm_e421e9cd'
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/resource_tracker.py", line 209, in main
cache[rtype].remove(name)
KeyError: '/psm_e421e9cd'
WARNING 06-12 12:37:00 custom_all_reduce.py:170] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorkerProcess pid=4701) WARNING 06-12 12:37:00 custom_all_reduce.py:170] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorkerProcess pid=4699) WARNING 06-12 12:37:00 custom_all_reduce.py:170] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorkerProcess pid=4700) WARNING 06-12 12:37:00 custom_all_reduce.py:170] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorkerProcess pid=4699) INFO 06-12 12:39:20 model_runner.py:159] Loading model weights took 9.7410 GB
INFO 06-12 12:39:20 model_runner.py:159] Loading model weights took 9.7410 GB
(VllmWorkerProcess pid=4701) INFO 06-12 12:39:20 model_runner.py:159] Loading model weights took 9.7410 GB
(VllmWorkerProcess pid=4700) INFO 06-12 12:39:20 model_runner.py:159] Loading model weights took 9.7410 GB
INFO 06-12 12:39:58 distributed_gpu_executor.py:56] # GPU blocks: 6049, # CPU blocks: 3276
(VllmWorkerProcess pid=4699) INFO 06-12 12:40:03 model_runner.py:878] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(VllmWorkerProcess pid=4699) INFO 06-12 12:40:03 model_runner.py:882] CUDA graphs can take additional 13 GiB memory per GPU. If you are running out of memory, consider decreasing gpu_memory_utilization or enforcing eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage.
(VllmWorkerProcess pid=4700) INFO 06-12 12:40:03 model_runner.py:878] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(VllmWorkerProcess pid=4700) INFO 06-12 12:40:03 model_runner.py:882] CUDA graphs can take additional 13 GiB memory per GPU. If you are running out of memory, consider decreasing gpu_memory_utilization or enforcing eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage.
INFO 06-12 12:40:03 model_runner.py:878] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 06-12 12:40:03 model_runner.py:882] CUDA graphs can take additional 13 GiB memory per GPU. If you are running out of memory, consider decreasing gpu_memory_utilization or enforcing eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage.
(VllmWorkerProcess pid=4701) INFO 06-12 12:40:03 model_runner.py:878] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(VllmWorkerProcess pid=4701) INFO 06-12 12:40:03 model_runner.py:882] CUDA graphs can take additional 13 GiB memory per GPU. If you are running out of memory, consider decreasing gpu_memory_utilization or enforcing eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage.
(VllmWorkerProcess pid=4699) INFO 06-12 12:40:30 model_runner.py:954] Graph capturing finished in 27 secs.
(VllmWorkerProcess pid=4700) INFO 06-12 12:40:30 model_runner.py:954] Graph capturing finished in 27 secs.
(VllmWorkerProcess pid=4701) INFO 06-12 12:40:30 model_runner.py:954] Graph capturing finished in 26 secs.
INFO 06-12 12:40:30 model_runner.py:954] Graph capturing finished in 27 secs.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 06-12 12:40:30 serving_chat.py:92] Using default chat template:
INFO 06-12 12:40:30 serving_chat.py:92] {% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system
INFO 06-12 12:40:30 serving_chat.py:92] You are a helpful assistant.<|im_end|>
INFO 06-12 12:40:30 serving_chat.py:92] ' }}{% endif %}{{'<|im_start|>' + message['role'] + '
INFO 06-12 12:40:30 serving_chat.py:92] ' + message['content'] + '<|im_end|>' + '
INFO 06-12 12:40:30 serving_chat.py:92] '}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant
INFO 06-12 12:40:30 serving_chat.py:92] ' }}{% endif %}
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
WARNING 06-12 12:40:31 serving_embedding.py:141] embedding_mode is False. Embedding API will not work.
INFO: Started server process [1]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http:https://0.0.0.0:8000 (Press CTRL+C to quit)
INFO 06-12 12:40:41 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-12 12:40:51 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-12 12:41:01 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-12 12:41:11 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-12 12:41:21 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-12 12:41:31 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-12 12:41:41 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-12 12:41:51 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-12 12:42:01 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-12 12:42:11 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-12 12:42:21 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-12 12:42:31 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-12 12:42:41 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-12 12:42:51 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-12 12:43:01 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-12 12:43:11 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-12 12:43:21 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-12 12:43:31 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-12 12:43:41 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-12 12:43:51 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 06-12 12:44:01 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.

above is the log when I use:
vllm_qwen = dspy.HFClientVLLM(model="Qwen2-72B-Instruct-GPTQ-Int4", port=8000, url="http:https://localhost")
got the error:
Failed to parse JSON response: {"object":"error","message":"The model Qwen2-72B-Instruct-GPTQ-Int4 does not exist.","type":"NotFoundError","param":null,"code":404}

you can see that there is no exactly log for this request.

didoll-john · 2024-06-12T12:50:50Z

And I found that I can use it as openai api now, the point is:
vllm_qwen = dspy.OpenAI(model="Qwen2-72B-Instruct-GPTQ-Int4", api_base="http:https://localhost:8000/v1/", api_key='EMPTY')
There should be a "/" after "v1"

fivejjs · 2024-06-13T22:53:24Z

I will try openAI API with VLLM

fivejjs · 2024-06-14T03:41:27Z

And I found that I can use it as openai api now, the point is: vllm_qwen = dspy.OpenAI(model="Qwen2-72B-Instruct-GPTQ-Int4", api_base="http:https://localhost:8000/v1/", api_key='EMPTY') There should be a "/" after "v1"

the solution worked for hugging face models which could be updated into:

dspy/docs/api/local_language_model_clients/vLLM.md

Line 4 in 804a974

fivejjs mentioned this issue Jun 14, 2024

updated vLLM local client usage in docs markdown #1147

Merged

arnavsinghvi11 closed this as completed Jun 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error with VLLM #1136

Error with VLLM #1136

didoll-john commented Jun 12, 2024

tom-doerr commented Jun 12, 2024

didoll-john commented Jun 12, 2024 •

edited

Loading

didoll-john commented Jun 12, 2024

fivejjs commented Jun 13, 2024

fivejjs commented Jun 14, 2024

Error with VLLM #1136

Error with VLLM #1136

Comments

didoll-john commented Jun 12, 2024

tom-doerr commented Jun 12, 2024

didoll-john commented Jun 12, 2024 • edited Loading

didoll-john commented Jun 12, 2024

fivejjs commented Jun 13, 2024

fivejjs commented Jun 14, 2024

didoll-john commented Jun 12, 2024 •

edited

Loading