-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[autoscaler] Autoscaler metrics #16066
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have a PromtheusMetrics Class.
have that as a member variable in the autoscaler.
0909b5e
to
9050bab
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just need some tests:
- One for metrics.json (that it is populated)
- One for test_autoscaler (that the values are correct).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One other comment:
Let's have the HTTPServer start in monitor.py
and then pass in the AutoscalerPrometheusMetrics
to StandardAutoscaler
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, can you add one test to test_autoscaler.py
to test if it handles updating a few metrics. You can mock out a metrics by doing something like:
from unittest._mock import Mock
mock_metrics = Mock()
autoscaler = StandardAutoscaler(..., prom_metrics=mock_metrics)
# some calls
assert len(mock_metrics.<some_metric.inc.mock_calls) == <some_number>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
@edoakes @kathryn-zhou , can you please review this as well? Are there other metrics you think are worth adding? |
Extra thing I spotted that might be useful would be number of failed/successful updates by just tracking the total counts in these two dicts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DmitriGekhtman can you please review this?
self.provider.create_node(node_config, node_tags, count) | ||
startup_time = time.time() - launch_start_time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this reflects startup time. For most (all?) providers, create_node
sends a non-blocking API call to provision compute.
Startup time is the time from "create_node" to completion of ray start commands on the node, which in theory one could slightly overestimate as the time between
the autoscaler sticking the node into the launch queue
and
the node's first NodeUpdater thread completing.
Not sure how easy that is to measure.
@@ -126,6 +132,16 @@ def __init__(self, | |||
self.autoscaling_config = autoscaling_config | |||
self.autoscaler = None | |||
|
|||
self.prom_metrics = AutoscalerPrometheusMetrics() | |||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be ok to skip this try-except block if monitor_ip
is none (indicating we're not collecting autoscaler metrics)?
In general, it would be ideal if things looked the same externally as before when monitor_ip=None
.
The Kubernetes Operator currently runs multiple monitor processes at the same ip, which is why we're skipping doing this support for k8s right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't monitor_ip = redis_address? why do we need monitor_ip?
I think that if the goal was to make sure this works even when we move autoscaler out of the head node, then this can be a separate discussion because not sure how you can connect to the IP of a node not available in your network.
Right -- for the immediate moment, we don't actually need this. In its current form, the K8s operator runs a bunch of monitors for different Ray clusters at a single IP For right now, we should make sure the behavior of the K8s operator is unaffected by these changes. |
tag_filters={TAG_RAY_NODE_KIND: NODE_KIND_WORKER}) | ||
# Update running nodes gauge whenever we check workers | ||
self.prom_metrics.running_nodes.set(len(nodes)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is just workers, it does not include the head node, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a follow-up, can we add
- Counter of total number of failed nodes.
- Worker UpdaterThread time.
@@ -237,6 +244,7 @@ def _update(self): | |||
self.provider.terminate_nodes(nodes_to_terminate) | |||
for node in nodes_to_terminate: | |||
self.node_tracker.untrack(node) | |||
self.prom_metrics.stopped_nodes.inc() | |||
nodes = self.workers() | |||
|
|||
to_launch = self.resource_demand_scheduler.get_nodes_to_launch( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO in this PR: |
Co-authored-by: Ian <[email protected]>
Co-authored-by: Ian <[email protected]>
Why are these changes needed?
Exposes the following metrics to provide better observability into the autoscaler:
started_nodes
(counter)stopped_nodes
(counter)pending_nodes
(gauge)running_nodes
(gauge)worker_startup_time
(histogram)update_loop_exceptions
(counter)reset_exceptions
(counter)node_launch_exceptions
(counter)config_validation_exceptions
(counter)Metrics are exposed by default on port 44217 of whichever machine hosts the monitor, which can be overwritten with the environment variable
AUTOSCALER_METRIC_PORT
.Related issue number
Checks
scripts/format.sh
to lint the changes in this PR.