Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[serve] Add metrics for Serve controller reconcile loop health #35667

Closed
edoakes opened this issue May 23, 2023 · 10 comments · Fixed by #43797
Closed

[serve] Add metrics for Serve controller reconcile loop health #35667

edoakes opened this issue May 23, 2023 · 10 comments · Fixed by #43797
Assignees
Labels
enhancement Request for new feature and/or capability P1 Issue that should be fixed within a few weeks serve Ray Serve Related Issue

Comments

@edoakes
Copy link
Contributor

edoakes commented May 23, 2023

We should have a monotonically-increasing counter as well as latency metrics. These should be reported under system metrics on the dashboard.

@edoakes edoakes added enhancement Request for new feature and/or capability P1 Issue that should be fixed within a few weeks serve Ray Serve Related Issue labels May 23, 2023
@edoakes edoakes changed the title [serve] Add metrics for Serve reconcile loop health [serve] Add metrics for Serve controller reconcile loop health May 23, 2023
@edoakes
Copy link
Contributor Author

edoakes commented May 23, 2023

Also:

  • Counter for number of times it's been restarted across the cluster.
  • Log time it takes to finish recovering from checkpoint.

@edoakes
Copy link
Contributor Author

edoakes commented Jul 12, 2023

@GeneDer let's try to prioritize this for 2.7 release

@edoakes edoakes self-assigned this Jul 12, 2023
@akshay-anyscale
Copy link
Contributor

#38000
#38040

@shrekris-anyscale
Copy link
Contributor

Log time it takes to finish recovering from checkpoint.

@edoakes Currently, the controller recovers from checkpoints by reading cached information from the GCS and setting the goal state to that info. Is the time it takes to finish recovering from the checkpoint equal to the time it takes for the control loop to finish reconciling the current state to match the goal state? Or is it only the time it takes to update the goal state?

@edoakes
Copy link
Contributor Author

edoakes commented Aug 7, 2023

@shrekris-anyscale see the main controller loop, deployment state returns a bit when all replicas are finished recovering (should be fast in most cases, if it's slow might cause problems). We might already have such a log line, but let's make sure it's there w/ a time.

@shrekris-anyscale
Copy link
Contributor

shrekris-anyscale commented Aug 7, 2023

Oh got it, I added a log statement to this PR.

@edoakes
Copy link
Contributor Author

edoakes commented Aug 8, 2023

@alanwguo -- @shrekris-anyscale has added a bunch of useful system-level metrics for the controller. Could you or someone else working on observability add these to the system tab of the serve dashboard at some point soon?

@edoakes
Copy link
Contributor Author

edoakes commented Nov 6, 2023

ping @alanwguo

1 similar comment
@edoakes
Copy link
Contributor Author

edoakes commented Mar 5, 2024

ping @alanwguo

@alanwguo
Copy link
Contributor

alanwguo commented Mar 5, 2024

@architkulkarni , can you take a look at this one after your current task?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Request for new feature and/or capability P1 Issue that should be fixed within a few weeks serve Ray Serve Related Issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants