-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add a new API for multi-node/multi-gpu #3871
Conversation
Signed-off-by: jooho lee <[email protected]>
982d6de
to
06dba82
Compare
/rerun-all |
0d6118b
to
10de564
Compare
Signed-off-by: jooho lee <[email protected]>
10de564
to
51883ad
Compare
is it means there is no autoscaling for the set of multi-host serving? (N x (host+worker)). I checked the solution is trying to fit the host and worker (or the LWS) from vllm multi-host serving into KServe ServingRuntime. How about the native statefulset? (I saw it under the proposal slide on the open question as well) it is possible to implement by statefulset and set the index 0 as the head node, will it be easier? (in this case, implementation-wise just need to add an option in servingruntime to deploy it as statefulset or deployment). My main concern is whether the head and worker setup for multi-host serving is finalized in vLLM or it will switch to other implementation else in the future. Just to raise a point, feel free to give your thoughts. |
@lizzzcai Great question. First, this proposal aims to add new APIs for multi-node/multi-GPU functionality. Your question focuses more on the implementation side, so let me share my thoughts on that. As you’ve understood, the current multi-node/multi-GPU functionality is leveraging the vLLM ServingRuntime, which uses Ray for orchestration. To utilize this feature in KServe, it’s crucial to determine how the Ray cluster should be set up. There are various orchestration methods we could use, such as StatefulSet, LWS (Lightweight State), or KubeRay. Among these, we're currently considering using the most Kubernetes-native approach, which is StatefulSet. However, I can't guarantee that we can use index 0 in the StatefulSet as the head node because the Ray cluster commands differ. Managing both the head and worker nodes within a single StatefulSet could potentially increase complexity.
If a new runtime that doesn’t rely on Ray clusters is introduced for multi-node/multi-GPU, additional logic might be required. Should LWS be adopted in the future, I think it would be beneficial to support it and allow users to choose the orchestration method that best suits their needs. |
In the example, what is the replicas value as per the spec? How is this made compatible with the current serving runtime(single node)? |
basically, the original replicas in the spec is working as usual. If the workerSpec is specified, the replicas in the spec will be ignored and set 1 always for the ray header pod. The replicas in workerSpec will be set for worker node count. |
/rerun-all |
It is possible by passing the index as env and start from a script to run different command based on the env. I don't have a strong opinion on this.
Which means only one replica will be running. Is there a plan to support
Is it possible to add a |
I mean the ray command should be different per head/worker node that will be set with
This is a good point. For this, we should have LB to support the sticky session. I didn't consider this much at this stage because it is more advance feature. However, I like this idea. To summarize, the spec.replicas imply the num of head/worker set and spec.workerSpec.replicas present the num of worker node( head node is always 1)
Are there cases where StatefulSets are necessary in RawDeployment mode? For example, in scenarios like VLLM multi-node/multi-GPU setups, where maintaining a consistent name is important, using StatefulSets makes sense. However, if there is no general reason to use StatefulSets in typical cases, is it necessary to have this |
Hi @Jooho
Yes, supporting the basic scenario with replica 1 (1 set of head worker set) should be good enough as a starting point.
I saw some cases that user faces disk pressure issues in a small node to deploy a large model., so they want to attach a PVC. The volumeClaimTemplates in the statefulset seems to be an easy and quick solution to create PVCs (EBS) for multiple replicas. For deployment, you will need RW-Many storage to share a single PVC to multiple deployment replicas, which need some additional setup, like provisioning an EFS. It may not be the best solution here, with the flexibility to deploy as deployment or statefulset can help to implement it. |
hi @lizzzcai
I was unaware of this issue. Since I haven't been able to confirm exactly what the issue is, and I can't determine whether StatefulSet is the best solution, I think this should be addressed separately from this proposal. wdyt? |
Hi @Jooho , yes this use case can be addressed in another issue for further discussion. However for this new API, do you assume the head and worker will be statefulset? in this case, is it assumed that when there is a |
Yes, that is the right assumption. By the way, I am thinking to use
|
Signed-off-by: jooho lee <[email protected]>
Signed-off-by: Jooho Lee <[email protected]>
Signed-off-by: jooho lee <[email protected]>
4d7592a
to
aefd9a1
Compare
/rerun-all |
1 similar comment
/rerun-all |
Signed-off-by: jooho lee <[email protected]>
/rerun-all |
Signed-off-by: jooho lee <[email protected]>
eb420ac
to
12187a0
Compare
@Jooho BTW, I'm confused about why the replicas was removed from the workerSpec. Can you elaborate on it? As I no longer see a way to right-size the number of workers. |
@israel-hdez sure. The replics should be automatically set by the |
Signed-off-by: Jooho Lee <[email protected]>
Signed-off-by: jooho lee <[email protected]>
/rerun-all |
Hmm... So, this makes sense but only for vLLM. But it is weird IMO, because you'd need to inspect |
Signed-off-by: jooho lee <[email protected]>
d65b400
to
ac3a1e1
Compare
I discussed with @israel-hdez and re-added the Please review this @yuzisun |
Signed-off-by: jooho lee <[email protected]>
65ba3d3
to
e91204b
Compare
Signed-off-by: jooho lee <[email protected]>
e91204b
to
a101881
Compare
/rerun-all |
|
||
// Configure the number of replicas in the worker set, each worker set represents the unit of scaling | ||
// +optional | ||
Size int `json:"size,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we consider using the pipeline parallelism terminology which is more familiar to ML engineers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ML engineers still use the pipeline parallell size.
- If the pipeline-parallel-size is set in the environment variables, that value will be utilized.
- If the value isn't set and the size is specified instead, the size will be assigned the value of pipeline-parallel-size.
- If both are not specified, the default value will be assigned the value(2 -> 1 head/1worker) of pipeline-parallel-size.
Now, both options will be available for configuration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about rename size
to pipelineParallelismSize
? any benefits need both field and the environment variables? it is also easier to validate fields for pipeline parallelism
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find that pipelineParallelismSize
can be somewhat confusing in the context of WorkerSpec.
Is there a need for both the field and the environment variables?
- For MLOps engineers, specifying size in WorkerSpec seems like a straightforward approach to increasing the number of worker nodes.
- On the other hand, Data Scientists might find it more intuitive to use pipeline-parallel-size.
it is also easier to validate fields for pipeline parallelism
In the end, the validating part is to check whether the worker node size
+1(head node) and pipeline-parellel-size
are the same or different. As I explained above, the size is determined according to the priority of each field, and this value is set to worker node size
and pipeline-parellel-size
, so I think the validating part will be resolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find the naming closely related to a chicken-egg problem:
- By naming it
size
our naming will be aligned to LeaderWorkerSet. - By naming it
pipelineParallelismSize
, it is going to be more familiar for people using vLLM.
I, personally, like more the size
naming because it is more related to infrastructure config (which is closer to KServe concern) than to the technique used for distributing the load.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pipeline parallelism is not just vLLM concept, it is a common LLM inference terminology and kserve is in a good position to standardize these concepts instead of making them second citizen fields.
https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if following LWS is a strong argument as we are solving specific inference problems here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry about the strong opinion, but I really want to make kserve focusing on LLM inference specification and standardize these concepts for GenAI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed with @Jooho , will make a proposed change in a separate PR, good to approve this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry about the strong opinion, but I really want to make kserve focusing on LLM inference specification and standardize these concepts for GenAI.
Just to contribute to the discussion... Well, the way I've understood the ServingRuntime
CRD is that it is a generic CRD for configuring any runtime you'd like to use. This was my motivation for preferring the more generic size
naming. Under this understanding, using pipelineParallelismSize
didn't make sense to me, as we anyway would need the runtime to support it and, AFAIK, KServe would still be limited to passing down this value to the runtime container (e.g. through envvars), and it's going to be the runtime the one applying the config (or ignoring it).
Now... I'm not against using pipelineParallelismSize
naming, but if we do it my thought is that KServe would be communicating a closer integration with the model servers, as not any server implements pipeline parallelism. Maybe is it better to introduce a specialized CRD for LLM inferencing? We just discussed pipeline paralellism size, but we could easily add to this discussion the tensor paralellism size that also would be important (e.g. 2 GPUs on each worker).
LGTM |
@yuzisun Could you please approve this pr? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
What this PR does / why we need it:
Motivation for Multi-NODE/Multi-GPU Support for Inference
As models continue to grow in size, it has become increasingly challenging to fit these large models into the memory of a single GPU. However, they can often be accommodated within the combined memory of multiple GPUs. Existing techniques such as tensor parallelism and pipeline parallelism allow for the division of models, enabling them to run in parallel across multiple Nodes/GPUs, which significantly enhances performance.
This feature is already supported natively in the vLLM ServingRuntime by leveraging Ray Cluster for multi-GPU and multi-node deployments.
Prerequisites for Using This Feature
Before utilizing this feature, there are several important considerations:
Shared Model Deployment
Auto-scaling for Multi-Node/GPU Configuration
Supported protocols
Head node replicas must be always 1
API Additions Required in KServe
To support this feature, the following CRD (Custom Resource Definition) changes are necessary within KServe:
Manifest Examples
ServingRuntime
InferenceService
workSpec can be set in isvc as well.
References:
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes # #3870
Type of changes
Please delete options that are not relevant.
Feature/Issue validation/testing:
Please describe the tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.
Test A
Test B
Logs
Special notes for your reviewer:
Checklist:
Release note:
Re-running failed tests
/rerun-all
- rerun all failed workflows./rerun-workflow <workflow name>
- rerun a specific failed workflow. Only one workflow name can be specified. Multiple /rerun-workflow commands are allowed per comment.