Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add a new API for multi-node/multi-gpu #3871

Merged
merged 17 commits into from
Oct 3, 2024

Conversation

Jooho
Copy link
Contributor

@Jooho Jooho commented Aug 19, 2024

What this PR does / why we need it:

Motivation for Multi-NODE/Multi-GPU Support for Inference

As models continue to grow in size, it has become increasingly challenging to fit these large models into the memory of a single GPU. However, they can often be accommodated within the combined memory of multiple GPUs. Existing techniques such as tensor parallelism and pipeline parallelism allow for the division of models, enabling them to run in parallel across multiple Nodes/GPUs, which significantly enhances performance.

This feature is already supported natively in the vLLM ServingRuntime by leveraging Ray Cluster for multi-GPU and multi-node deployments.

Prerequisites for Using This Feature

Before utilizing this feature, there are several important considerations:

Shared Model Deployment

  • Since the same model needs to be deployed across multiple nodes, it is essential to share the model via Persistent Volume Claims (PVCs).
  • The PVC must be attachable to multiple pods simultaneously, necessitating the use of file storage solutions like EFS (Elastic File System) or NFS (Network File System).

Auto-scaling for Multi-Node/GPU Configuration

  • Unlike simple HPA-based auto-scaling, the scaling strategy for this feature must consider Ray Cluster's autoscaling capabilities and specific parameters(tensor parallelism and pipeline parallelism) to choose the appropriate number of nodes and GPUs.
  • Even with all requirements met, there will be a temporary service disruption while the model's parameters or layers are redistributed during autoscaling.

Supported protocols

  • Since the same model needs to be loaded across multiple nodes, utilizing PVCs is the most efficient approach and should be prioritized. Currently, the new feature, ModelCache, is under development. If we can use modelcache to directly download and store models in PVCs, it will significantly simplify the process of downloading models to PVCs for multi-node/multi-GPU functionality by leveraging modelcache.

Head node replicas must be always 1

  • Ray cluster only allow 1 head node so ServingRuntime replicas has to be 1 all the time.
  • replicas in WorkerSpec will be the worker node size.

API Additions Required in KServe

To support this feature, the following CRD (Custom Resource Definition) changes are necessary within KServe:

  • ServingRuntime
    • WorkerSpec:
      • Incorporates ServingRuntimePodSpec and Replicas to define worker configurations.
  • InferenceService
    • WorkerSpec in ServingRuntime:
      • Integration of WorkerSpec from ServingRuntime to configure inference services accordingly.

Manifest Examples

ServingRuntime

apiVersion: serving.kserve.io/v1alpha1
kind: ClusterServingRuntime
metadata:
  name: kserve-huggingfaceserver
spec:
  ...
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: nvidia.com/gpu.product
            operator: In
            values:
            - NVIDIA-A10G
  tolerations:
    - key: multi-node-inference
      operator: Equal
      value: 'true'
      effect: NoSchedule    
  containers:
    - name: kserve-container
      image: kserve/vllm:latest
      command: ["bash", "-c"]
      args:
      - |
        ray start --head --node-ip-address ${POD_IP}
      env:
      - name: POD_IP
        valueFrom:
          fieldRef:
            fieldPath: status.podIP
      resources:
        limits:
          cpu: "6"
          memory: 24Gi
          nvidia.com/gpu: "1"
        requests:
          cpu: "6"
          memory: 24Gi
          nvidia.com/gpu: "1"
  workerSpec:
    affinity:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
          - matchExpressions:
            - key: nvidia.com/gpu.product
              operator: In
              values:
              - NVIDIA-A10G
    tolerations:
      - key: multi-node-inference
        operator: Equal
        value: 'true'
        effect: NoSchedule                  
    containers:
    - name: kserve-container
      image: kserve/vllm:latest
      command: ["bash", "-c"]
      args:
      - |
        ray start --address="${HEAD_POD_NAME}.${POD_NAMESPACE}.svc.cluster.local:6379" --node-ip-address ${POD_NAME}.${POD_NAMESPACE}.svc.cluster.local; 
      env: 
      - name: POD_NAME
        valueFrom:
          fieldRef:
            fieldPath: metadata.name          
      - name: POD_NAMESPACE
        valueFrom:
          fieldRef:
            fieldPath: metadata.namespace              
      resources:
        limits:
          cpu: "6"
          memory: 24Gi
          nvidia.com/gpu: "1"
        requests:
          cpu: "6"
          memory: 24Gi
          nvidia.com/gpu: "1"     

InferenceService

kubectl apply -f - <<EOF
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: huggingface-llama3
spec:
  predictor:
    model:
      modelFormat:
        name: huggingface
      args:
        - --model-name=/models/hf/8b_instruction_tuned
        - --model_id=meta-llama/Meta-Llama-3-8B-Instruct
        - --tensor-parallel-size=2
        - --pipeline-parallel-size=2        
      storageUri: "pvc:https://llama-3-8b-pvc"
      resources:
        limits:
          cpu: "6"
          memory: 24Gi
          nvidia.com/gpu: "1"
        requests:
          cpu: "6"
          memory: 24Gi
          nvidia.com/gpu: "1"

workSpec can be set in isvc as well.

References:

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes # #3870

Type of changes
Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

Feature/Issue validation/testing:

Please describe the tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.

  • Test A

  • Test B

  • Logs

Special notes for your reviewer:

  1. Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

Checklist:

  • Have you added unit/e2e tests that prove your fix is effective or that this feature works?
  • Has code been commented, particularly in hard-to-understand areas?
  • Have you made corresponding changes to the documentation?

Release note:


Re-running failed tests

  • /rerun-all - rerun all failed workflows.
  • /rerun-workflow <workflow name> - rerun a specific failed workflow. Only one workflow name can be specified. Multiple /rerun-workflow commands are allowed per comment.

@Jooho Jooho marked this pull request as draft August 19, 2024 08:55
@terrytangyuan
Copy link
Member

@Jooho Jooho force-pushed the add_new_api_for_multi_node branch from 982d6de to 06dba82 Compare August 20, 2024 03:42
@Jooho
Copy link
Contributor Author

Jooho commented Aug 20, 2024

/rerun-all

@Jooho Jooho force-pushed the add_new_api_for_multi_node branch 5 times, most recently from 0d6118b to 10de564 Compare August 20, 2024 08:42
Signed-off-by: jooho lee <[email protected]>
@lizzzcai
Copy link
Member

Head node replicas must be always 1

Ray cluster only allow 1 head node so ServingRuntime replicas has to be 1 all the time.
replicas in WorkerSpec will be the worker node size.

is it means there is no autoscaling for the set of multi-host serving? (N x (host+worker)).

I checked the solution is trying to fit the host and worker (or the LWS) from vllm multi-host serving into KServe ServingRuntime. How about the native statefulset? (I saw it under the proposal slide on the open question as well) it is possible to implement by statefulset and set the index 0 as the head node, will it be easier? (in this case, implementation-wise just need to add an option in servingruntime to deploy it as statefulset or deployment).

My main concern is whether the head and worker setup for multi-host serving is finalized in vLLM or it will switch to other implementation else in the future.

Just to raise a point, feel free to give your thoughts.

@Jooho
Copy link
Contributor Author

Jooho commented Aug 22, 2024

@lizzzcai Great question.

First, this proposal aims to add new APIs for multi-node/multi-GPU functionality. Your question focuses more on the implementation side, so let me share my thoughts on that.

As you’ve understood, the current multi-node/multi-GPU functionality is leveraging the vLLM ServingRuntime, which uses Ray for orchestration. To utilize this feature in KServe, it’s crucial to determine how the Ray cluster should be set up. There are various orchestration methods we could use, such as StatefulSet, LWS (Lightweight State), or KubeRay. Among these, we're currently considering using the most Kubernetes-native approach, which is StatefulSet.

However, I can't guarantee that we can use index 0 in the StatefulSet as the head node because the Ray cluster commands differ. Managing both the head and worker nodes within a single StatefulSet could potentially increase complexity.

My main concern is whether the head and worker setup for multi-host serving is finalized in vLLM or if it will switch to another implementation in the future.

If a new runtime that doesn’t rely on Ray clusters is introduced for multi-node/multi-GPU, additional logic might be required. Should LWS be adopted in the future, I think it would be beneficial to support it and allow users to choose the orchestration method that best suits their needs.

@johnugeorge
Copy link
Contributor

In the example, what is the replicas value as per the spec? How is this made compatible with the current serving runtime(single node)?

@Jooho
Copy link
Contributor Author

Jooho commented Aug 27, 2024

In the example, what is the replicas value as per the spec? How is this made compatible with the current serving runtime(single node)?

basically, the original replicas in the spec is working as usual. If the workerSpec is specified, the replicas in the spec will be ignored and set 1 always for the ray header pod. The replicas in workerSpec will be set for worker node count.

@Jooho Jooho changed the title [Draft] add a new API for multi-node/multi-gpu add a new API for multi-node/multi-gpu Aug 28, 2024
@Jooho Jooho marked this pull request as ready for review August 28, 2024 02:11
@Jooho
Copy link
Contributor Author

Jooho commented Aug 28, 2024

/rerun-all

@lizzzcai
Copy link
Member

However, I can't guarantee that we can use index 0 in the StatefulSet as the head node because the Ray cluster commands differ. Managing both the head and worker nodes within a single StatefulSet could potentially increase complexity.

It is possible by passing the index as env and start from a script to run different command based on the env. I don't have a strong opinion on this.

basically, the original replicas in the spec is working as usual. If the workerSpec is specified, the replicas in the spec will be ignored and set 1 always for the ray header pod. The replicas in workerSpec will be set for worker node count.

Which means only one replica will be running. Is there a plan to support num of replicas x head worker set? which kind of align with existing replicas concept.

Among these, we're currently considering using the most Kubernetes-native approach, which is StatefulSet.

Is it possible to add a type (or other good naming) in servingruntime spec and worker spec to define which k8s resources will be deployed? for example, deployment or statefulset. I think it will provide some flexibility for raw deployment mode.

@Jooho
Copy link
Contributor Author

Jooho commented Aug 30, 2024

It is possible by passing the index as env and start from a script to run different command based on the env. I don't have a strong opinion on this.

I mean the ray command should be different per head/worker node that will be set with command or args. If we want to go with 1 statefulset, we need a script to manage both but in this case, the command could not be modified in the servingRuntime or Inferenceservice.

Which means only one replica will be running. Is there a plan to support num of replicas x head worker set? which kind of align with existing replicas concept.

This is a good point. For this, we should have LB to support the sticky session. I didn't consider this much at this stage because it is more advance feature. However, I like this idea. To summarize, the spec.replicas imply the num of head/worker set and spec.workerSpec.replicas present the num of worker node( head node is always 1)
However, this may be something to address if requirements arise in the future. Generally, it seems that deploying models with such extensive resource usage is not very common. For now, it makes sense to support the most widely used solutions, and consider additional development for Sticky Sessions only if the need arises later. What do you think?

Is it possible to add a type (or other good naming) in servingruntime spec and worker spec to define which k8s resources will be deployed? for example, deployment or statefulset. I think it will provide some flexibility for raw deployment mode.

Are there cases where StatefulSets are necessary in RawDeployment mode? For example, in scenarios like VLLM multi-node/multi-GPU setups, where maintaining a consistent name is important, using StatefulSets makes sense. However, if there is no general reason to use StatefulSets in typical cases, is it necessary to have this type API?

@lizzzcai
Copy link
Member

Hi @Jooho

However, this may be something to address if requirements arise in the future. Generally, it seems that deploying models with such extensive resource usage is not very common. For now, it makes sense to support the most widely used solutions, and consider additional development for Sticky Sessions only if the need arises later. What do you think?

Yes, supporting the basic scenario with replica 1 (1 set of head worker set) should be good enough as a starting point.

Are there cases where StatefulSets are necessary in RawDeployment mode? For example, in scenarios like VLLM multi-node/multi-GPU setups, where maintaining a consistent name is important, using StatefulSets makes sense. However, if there is no general reason to use StatefulSets in typical cases, is it necessary to have this type API?

I saw some cases that user faces disk pressure issues in a small node to deploy a large model., so they want to attach a PVC. The volumeClaimTemplates in the statefulset seems to be an easy and quick solution to create PVCs (EBS) for multiple replicas. For deployment, you will need RW-Many storage to share a single PVC to multiple deployment replicas, which need some additional setup, like provisioning an EFS. It may not be the best solution here, with the flexibility to deploy as deployment or statefulset can help to implement it.

@Jooho
Copy link
Contributor Author

Jooho commented Sep 3, 2024

hi @lizzzcai

I saw some cases that user faces disk pressure issues in a small node to deploy a large model., so they want to attach a PVC. The volumeClaimTemplates in the statefulset seems to be an easy and quick solution to create PVCs (EBS) for multiple replicas. For deployment, you will need RW-Many storage to share a single PVC to multiple deployment replicas, which need some additional setup, like provisioning an EFS. It may not be the best solution here, with the flexibility to deploy as deployment or statefulset can help to implement it.

I was unaware of this issue. Since I haven't been able to confirm exactly what the issue is, and I can't determine whether StatefulSet is the best solution, I think this should be addressed separately from this proposal. wdyt?

@lizzzcai
Copy link
Member

lizzzcai commented Sep 4, 2024

Hi @Jooho , yes this use case can be addressed in another issue for further discussion.

However for this new API, do you assume the head and worker will be statefulset? in this case, is it assumed that when there is a workerSpec provided, the workload will be created as statefulset? another reason I bring up the type field is whether you want to set it explicitly.

@Jooho
Copy link
Contributor Author

Jooho commented Sep 4, 2024

Hi @Jooho , yes this use case can be addressed in another issue for further discussion.

However for this new API, do you assume the head and worker will be statefulset? in this case, is it assumed that when there is a workerSpec provided, the workload will be created as statefulset? another reason I bring up the type field is whether you want to set it explicitly.

Yes, that is the right assumption.
In the first phase, I plan to use StatefulSets when workerSpec is specified and use Deployments for everything else. I think it's better to change small parts one by one rather than adding this type all at once to interchangeable between Deployment and statefulset. It should be done with a separate GitHub issue later. If there is no problem supporting vllm multi-node/multi-gpu through statefulset, I think adding types in the future will not be difficult later.

By the way, I am thinking to use inferenceservice-config configmap to set the Type something like this

 deploy: |-
    {
      "defaultDeploymentMode": "Serverless"
      "defaultPodManagementPolicy": "statefulSet"
    }

@Jooho
Copy link
Contributor Author

Jooho commented Sep 20, 2024

/rerun-all

1 similar comment
@Jooho
Copy link
Contributor Author

Jooho commented Sep 20, 2024

/rerun-all

@Jooho
Copy link
Contributor Author

Jooho commented Sep 20, 2024

/rerun-all

@israel-hdez
Copy link
Contributor

@Jooho BTW, I'm confused about why the replicas was removed from the workerSpec. Can you elaborate on it? As I no longer see a way to right-size the number of workers.

@Jooho
Copy link
Contributor Author

Jooho commented Sep 25, 2024

@israel-hdez sure.

The replics should be automatically set by the pipeline-parallel-size because pipeline-parallel-size value stands for node numbers. So even though we have the replicas, it will be ignored if the value is different from pipeline-parallel-size
So I decided to remove the replicas. I hope this makes sense to you? @israel-hdez

@Jooho
Copy link
Contributor Author

Jooho commented Sep 25, 2024

/rerun-all

@israel-hdez
Copy link
Contributor

The replics should be automatically set by the pipeline-parallel-size because pipeline-parallel-size value stands for node numbers

Hmm... So, this makes sense but only for vLLM. But it is weird IMO, because you'd need to inspect container[kserve-container].args to find out what's the value assigned to it. If you are using some other runtime, the argument can be different. Furthermore, you may be able to use ray CLI to figure out the number of nodes and automatically adjust the pipeline-parallel-size argument for vLLM.

@Jooho
Copy link
Contributor Author

Jooho commented Sep 27, 2024

I discussed with @israel-hdez and re-added the Size field under workerSpec, making some modifications to its functionality. The plan is that (size value +1(head)) will be set for pipeline-parallel-size. If both Size and pipeline-parallel-size are specified at the same time, pipeline-parallel-size will take precedence.

Please review this @yuzisun

Signed-off-by: jooho lee <[email protected]>
@Jooho
Copy link
Contributor Author

Jooho commented Sep 27, 2024

/rerun-all


// Configure the number of replicas in the worker set, each worker set represents the unit of scaling
// +optional
Size int `json:"size,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we consider using the pipeline parallelism terminology which is more familiar to ML engineers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ML engineers still use the pipeline parallell size.

  • If the pipeline-parallel-size is set in the environment variables, that value will be utilized.
  • If the value isn't set and the size is specified instead, the size will be assigned the value of pipeline-parallel-size.
  • If both are not specified, the default value will be assigned the value(2 -> 1 head/1worker) of pipeline-parallel-size.

Now, both options will be available for configuration.

Copy link
Member

@yuzisun yuzisun Sep 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about rename size to pipelineParallelismSize ? any benefits need both field and the environment variables? it is also easier to validate fields for pipeline parallelism

Copy link
Contributor Author

@Jooho Jooho Sep 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find that pipelineParallelismSize can be somewhat confusing in the context of WorkerSpec.

Is there a need for both the field and the environment variables?

  • For MLOps engineers, specifying size in WorkerSpec seems like a straightforward approach to increasing the number of worker nodes.
  • On the other hand, Data Scientists might find it more intuitive to use pipeline-parallel-size.

it is also easier to validate fields for pipeline parallelism

In the end, the validating part is to check whether the worker node size+1(head node) and pipeline-parellel-size are the same or different. As I explained above, the size is determined according to the priority of each field, and this value is set to worker node size and pipeline-parellel-size, so I think the validating part will be resolved.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the naming closely related to a chicken-egg problem:

  • By naming it size our naming will be aligned to LeaderWorkerSet.
  • By naming it pipelineParallelismSize, it is going to be more familiar for people using vLLM.

I, personally, like more the size naming because it is more related to infrastructure config (which is closer to KServe concern) than to the technique used for distributing the load.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pipeline parallelism is not just vLLM concept, it is a common LLM inference terminology and kserve is in a good position to standardize these concepts instead of making them second citizen fields.
https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/

Copy link
Member

@yuzisun yuzisun Oct 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if following LWS is a strong argument as we are solving specific inference problems here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about the strong opinion, but I really want to make kserve focusing on LLM inference specification and standardize these concepts for GenAI.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed with @Jooho , will make a proposed change in a separate PR, good to approve this one.

Copy link
Contributor

@israel-hdez israel-hdez Oct 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about the strong opinion, but I really want to make kserve focusing on LLM inference specification and standardize these concepts for GenAI.

Just to contribute to the discussion... Well, the way I've understood the ServingRuntime CRD is that it is a generic CRD for configuring any runtime you'd like to use. This was my motivation for preferring the more generic size naming. Under this understanding, using pipelineParallelismSize didn't make sense to me, as we anyway would need the runtime to support it and, AFAIK, KServe would still be limited to passing down this value to the runtime container (e.g. through envvars), and it's going to be the runtime the one applying the config (or ignoring it).

Now... I'm not against using pipelineParallelismSize naming, but if we do it my thought is that KServe would be communicating a closer integration with the model servers, as not any server implements pipeline parallelism. Maybe is it better to introduce a specialized CRD for LLM inferencing? We just discussed pipeline paralellism size, but we could easily add to this discussion the tensor paralellism size that also would be important (e.g. 2 GPUs on each worker).

@johnugeorge
Copy link
Contributor

LGTM

@Jooho
Copy link
Contributor Author

Jooho commented Oct 2, 2024

@yuzisun Could you please approve this pr?

Copy link
Member

@yuzisun yuzisun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@yuzisun yuzisun merged commit d5ed018 into kserve:master Oct 3, 2024
58 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants