-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[serve] Add metrics for Serve controller reconcile loop health #35667
Comments
Also:
|
@GeneDer let's try to prioritize this for 2.7 release |
@edoakes Currently, the controller recovers from checkpoints by reading cached information from the GCS and setting the goal state to that info. Is the time it takes to finish recovering from the checkpoint equal to the time it takes for the control loop to finish reconciling the current state to match the goal state? Or is it only the time it takes to update the goal state? |
@shrekris-anyscale see the main controller loop, deployment state returns a bit when all replicas are finished recovering (should be fast in most cases, if it's slow might cause problems). We might already have such a log line, but let's make sure it's there w/ a time. |
Oh got it, I added a log statement to this PR. |
@alanwguo -- @shrekris-anyscale has added a bunch of useful system-level metrics for the controller. Could you or someone else working on observability add these to the system tab of the serve dashboard at some point soon? |
ping @alanwguo |
1 similar comment
ping @alanwguo |
@architkulkarni , can you take a look at this one after your current task? |
We should have a monotonically-increasing counter as well as latency metrics. These should be reported under system metrics on the dashboard.
The text was updated successfully, but these errors were encountered: