[serve] Add metrics for Serve controller reconcile loop health #35667

edoakes · 2023-05-23T18:07:00Z

We should have a monotonically-increasing counter as well as latency metrics. These should be reported under system metrics on the dashboard.

edoakes · 2023-05-23T18:11:03Z

Also:

Counter for number of times it's been restarted across the cluster.
Log time it takes to finish recovering from checkpoint.

edoakes · 2023-07-12T23:08:02Z

@GeneDer let's try to prioritize this for 2.7 release

akshay-anyscale · 2023-08-04T22:18:45Z

#38000
#38040

shrekris-anyscale · 2023-08-07T18:29:13Z

Log time it takes to finish recovering from checkpoint.

@edoakes Currently, the controller recovers from checkpoints by reading cached information from the GCS and setting the goal state to that info. Is the time it takes to finish recovering from the checkpoint equal to the time it takes for the control loop to finish reconciling the current state to match the goal state? Or is it only the time it takes to update the goal state?

edoakes · 2023-08-07T20:13:19Z

@shrekris-anyscale see the main controller loop, deployment state returns a bit when all replicas are finished recovering (should be fast in most cases, if it's slow might cause problems). We might already have such a log line, but let's make sure it's there w/ a time.

shrekris-anyscale · 2023-08-07T20:40:45Z

Oh got it, I added a log statement to this PR.

edoakes · 2023-08-08T21:52:51Z

@alanwguo -- @shrekris-anyscale has added a bunch of useful system-level metrics for the controller. Could you or someone else working on observability add these to the system tab of the serve dashboard at some point soon?

edoakes · 2023-11-06T16:08:55Z

ping @alanwguo

edoakes · 2024-03-05T18:11:59Z

ping @alanwguo

alanwguo · 2024-03-05T18:24:56Z

@architkulkarni , can you take a look at this one after your current task?

edoakes added enhancement Request for new feature and/or capability P1 Issue that should be fixed within a few weeks serve Ray Serve Related Issue labels May 23, 2023

edoakes changed the title ~~[serve] Add metrics for Serve reconcile loop health~~ [serve] Add metrics for Serve controller reconcile loop health May 23, 2023

akshay-anyscale assigned GeneDer May 25, 2023

edoakes self-assigned this Jul 12, 2023

shrekris-anyscale mentioned this issue Aug 7, 2023

[Serve] Improve observability for controller restarts #38177

Merged

5 tasks

akshay-anyscale assigned shrekris-anyscale and unassigned GeneDer Aug 24, 2023

alanwguo assigned architkulkarni and GeneDer Mar 5, 2024

architkulkarni mentioned this issue Mar 7, 2024

[Serve] [Dashboard] Add serve controller metrics to serve system dashboard page #43797

Merged

8 tasks

architkulkarni closed this as completed in #43797 Mar 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[serve] Add metrics for Serve controller reconcile loop health #35667

[serve] Add metrics for Serve controller reconcile loop health #35667

edoakes commented May 23, 2023 •

edited

Loading

edoakes commented May 23, 2023 •

edited

Loading

edoakes commented Jul 12, 2023

akshay-anyscale commented Aug 4, 2023

shrekris-anyscale commented Aug 7, 2023

edoakes commented Aug 7, 2023

shrekris-anyscale commented Aug 7, 2023 •

edited

Loading

edoakes commented Aug 8, 2023

edoakes commented Nov 6, 2023

edoakes commented Mar 5, 2024

alanwguo commented Mar 5, 2024

[serve] Add metrics for Serve controller reconcile loop health #35667

[serve] Add metrics for Serve controller reconcile loop health #35667

Comments

edoakes commented May 23, 2023 • edited Loading

edoakes commented May 23, 2023 • edited Loading

edoakes commented Jul 12, 2023

akshay-anyscale commented Aug 4, 2023

shrekris-anyscale commented Aug 7, 2023

edoakes commented Aug 7, 2023

shrekris-anyscale commented Aug 7, 2023 • edited Loading

edoakes commented Aug 8, 2023

edoakes commented Nov 6, 2023

edoakes commented Mar 5, 2024

alanwguo commented Mar 5, 2024

edoakes commented May 23, 2023 •

edited

Loading

edoakes commented May 23, 2023 •

edited

Loading

shrekris-anyscale commented Aug 7, 2023 •

edited

Loading