Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[autoscaler v2] Interface between autoscaler and gcs (ray-project#34680)
Why are these changes needed? This PR introduce the interface between GCS and Autoscaler. Specifically it introduces 2 APIs GetClusterResourceState: Autoscaler will query this interface to get cluster resource usage, which includes nodes (state and resource ulitization), as well as pending requests, which include ResourceRequest, GangResourceRequest, as well as ClusterResourceConstraint. For NodeState, it includes NodeStatus, which can transit from ALIVE -> DEAD, or ALIVE -> DRAIN_PENDING -> DRAINING -> DRAINED -> DEAD, or ALIVE -> DRAIN_PENDING -> DRAIN_FAILED. it also includes instance_id where the autoscaler is aware of, this allows autoscaler to do reconsiliation if available. For ResourceRequest, it comes with a PlacementConstraint which only support AntiAffinityConstraint today, which the semantics the resource request can't be allocated on a node with the same label/value specified in the AntiAffinityConstraint There is also GangResourceRequest, which has gang scheduling semantics where the requests in the gang should be all fulfilled atomically. ReportAutoscalingState: Autoscaler will also report its own state back to cluster using this API, where it includes all instances (including both pending launch), as well as infeasible requests. Instance state could transition from QUEUED -> REQUESTED -> BOOTSTRAPPING -> ALIVE -> TERMINATING -> DEAD. two special states are TO_BE_PREEMPTED and TO_BE_DRAINED, where one is force preemption, another is collaborating draining (can be reversed). It also reports back requests that infeasible, associated with a specific request version.
- Loading branch information