-
Notifications
You must be signed in to change notification settings - Fork 503
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SkyServe] Doc with vllm example #2922
Conversation
…ilot into serve-doc-vllm-example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cblmemo!
I tried out the vLLM example. It launched 2 replicas on Azure, but they got the following errors during setup
(task, pid=7557) INFO 01-01 18:01:44 llm_engine.py:73] Initializing an LLM engine with config: model='mistralai/Mixtral-8x7B-Instruct-v0.1', tokenizer='mistralai/Mixtral-8x7B-Instruct-v0.1', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=2, quantization=None, enforce_eager=False, seed=0)
tokenizer_config.json: 100%|██████████| 1.46k/1.46k [00:00<00:00, 650kB/s]
tokenizer.model: 100%|██████████| 493k/493k [00:00<00:00, 46.5MB/s]
tokenizer.json: 100%|██████████| 1.80M/1.80M [00:00<00:00, 29.2MB/s]?, ?B/s]
special_tokens_map.json: 100%|██████████| 72.0/72.0 [00:00<00:00, 136kB/s]
...
(task, pid=7557) torch.distributed.all_reduce(torch.zeros(1).cuda())
(task, pid=7557) File "/home/azureuser/miniconda3/envs/vllm/lib/python3.9/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
(task, pid=7557) return func(*args, **kwargs)
(task, pid=7557) File "/home/azureuser/miniconda3/envs/vllm/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 2050, in all_reduce
(task, pid=7557) work = group.allreduce([tensor], opts)
(task, pid=7557) torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.1
(task, pid=7557) ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
(task, pid=7557) Last error:
(task, pid=7557) XML Import Channel : dev 2 not found.
Relatedly, since the 2 replicas failed, 2 new ones are getting started. IIUC it's due to auto_restart. However does this restart behavior loop forever? (In managed spot, we do not restart user job failures by default.)
docs/source/serving/sky-serve.rst
Outdated
ports: 8080 | ||
accelerators: A100:1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Is it possible to keep ports / accelerators the same across the two tabs? Few changes to read.
- accelerators: is it possible to use
{A100, A10G, T4, L4}
etc. with appropriate counts? Higher chance to get provisioned, and showcase an important capability
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to a single A100 for the 13b Vicuna model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Wdyt about the second point?
Or, if we were to keep 1 gpu choice, can we change A100 to something else? When trying out this example, provisioning failover takes a long time on GCP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now added more choices borrowed from https://github.com/skypilot-org/skypilot/blob/master/llm/mixtral/serve.yaml
Co-authored-by: Zongheng Yang <[email protected]>
Co-authored-by: Zongheng Yang <[email protected]>
Co-authored-by: Zongheng Yang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fantastic @cblmemo! Just wondering if we can add the example in the llm/vllm/README.md
and also add the yaml in that folder?
docs/source/serving/sky-serve.rst
Outdated
ports: 8080 | ||
accelerators: A100:1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Wdyt about the second point?
Or, if we were to keep 1 gpu choice, can we change A100 to something else? When trying out this example, provisioning failover takes a long time on GCP.
Co-authored-by: Zongheng Yang <[email protected]>
Co-authored-by: Zongheng Yang <[email protected]>
Sure! Will do it after the doc is ready 🫡 |
docs/source/serving/sky-serve.rst
Outdated
@@ -62,7 +62,7 @@ Here is a simple example of serving an LLM model (:code:`lmsys/vicuna-13b-v1.5`) | |||
# Fields below describe each replica. | |||
resources: | |||
ports: 8080 | |||
accelerators: A100:1 | |||
accelerators: {L4:8, A10g:8, A10:8, A100:4, A100:8, A100-80GB:2, A100-80GB:4, A100-80GB:8} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This picks Azure's A100-80GB:2 for me and results in the same error
(task, pid=7600) File "/home/azureuser/miniconda3/envs/vllm/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 2050, in all_reduce
(task, pid=7600) work = group.allreduce([tensor], opts)
(task, pid=7600) torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.1
(task, pid=7600) ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
(task, pid=7600) Last error:
(task, pid=7600) XML Import Channel : dev 2 not found.
Did you not encounter this error?
Is there a good way to ban Azure:
I 01-05 14:13:07 optimizer.py:910] ----------------------------------------------------------------------------------------------------------
I 01-05 14:13:07 optimizer.py:910] CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN
I 01-05 14:13:07 optimizer.py:910] ----------------------------------------------------------------------------------------------------------
I 01-05 14:13:07 optimizer.py:910] Azure Standard_NC48ads_A100_v4 48 440 A100-80GB:2 eastus 7.35 ✔
I 01-05 14:13:07 optimizer.py:910] GCP g2-standard-96 96 384 L4:8 us-east4-a 7.98
I 01-05 14:13:07 optimizer.py:910] GCP a2-ultragpu-2g 24 340 A100-80GB:2 us-central1-a 10.06
I 01-05 14:13:07 optimizer.py:910] Azure Standard_NC96ads_A100_v4 96 880 A100-80GB:4 eastus 14.69
I 01-05 14:13:07 optimizer.py:910] GCP a2-highgpu-4g 48 340 A100:4 us-central1-a 14.69
I 01-05 14:13:07 optimizer.py:910] AWS g5.48xlarge 192 768 A10G:8 us-east-1 16.29
I 01-05 14:13:07 optimizer.py:910] GCP a2-ultragpu-4g 48 680 A100-80GB:4 us-central1-a 20.11
I 01-05 14:13:07 optimizer.py:910] Azure Standard_ND96asr_v4 96 900 A100:8 eastus 27.20
I 01-05 14:13:07 optimizer.py:910] GCP a2-highgpu-8g 96 680 A100:8 us-central1-a 29.39
I 01-05 14:13:07 optimizer.py:910] AWS p4d.24xlarge 96 1152 A100:8 us-east-1 32.77
I 01-05 14:13:07 optimizer.py:910] Azure Standard_ND96amsr_A100_v4 96 1924 A100-80GB:8 eastus 32.77
I 01-05 14:13:07 optimizer.py:910] GCP a2-ultragpu-8g 96 1360 A100-80GB:8 us-central1-a 40.22
I 01-05 14:13:07 optimizer.py:910] AWS p4de.24xlarge 96 1152 A100-80GB:8 us-east-1 40.97
I 01-05 14:13:07 optimizer.py:910] ----------------------------------------------------------------------------------------------------------
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't encountered this error; My controller chooses two L4:8 replicas
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, modulo the known issue on Azure #2905, which seems ok for now.
Co-authored-by: Zongheng Yang <[email protected]>
This PR adds an example for vLLM in SkyServe doc.
TODO: Make sure the example works.
Tested (run the relevant ones):
bash format.sh
sky serve up
the example and curl to make sure it workspytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
bash tests/backward_comaptibility_tests.sh