Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[autoscaler] Autoscaler metrics #16066

Merged
merged 32 commits into from
Jun 1, 2021
Merged

Conversation

ckw017
Copy link
Member

@ckw017 ckw017 commented May 25, 2021

Why are these changes needed?

Exposes the following metrics to provide better observability into the autoscaler:

  • started_nodes (counter)
  • stopped_nodes (counter)
  • pending_nodes (gauge)
  • running_nodes (gauge)
  • worker_startup_time (histogram)
  • update_loop_exceptions (counter)
  • reset_exceptions (counter)
  • node_launch_exceptions (counter)
  • config_validation_exceptions (counter)

Metrics are exposed by default on port 44217 of whichever machine hosts the monitor, which can be overwritten with the environment variable AUTOSCALER_METRIC_PORT.

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Copy link
Contributor

@ijrsvt ijrsvt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have a PromtheusMetrics Class.

have that as a member variable in the autoscaler.

Copy link
Contributor

@ijrsvt ijrsvt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just need some tests:

  • One for metrics.json (that it is populated)
  • One for test_autoscaler (that the values are correct).

python/ray/autoscaler/_private/prom_metrics.py Outdated Show resolved Hide resolved
python/ray/autoscaler/_private/prom_metrics.py Outdated Show resolved Hide resolved
python/ray/_private/metrics_agent.py Outdated Show resolved Hide resolved
python/ray/autoscaler/_private/autoscaler.py Outdated Show resolved Hide resolved
python/ray/autoscaler/_private/autoscaler.py Outdated Show resolved Hide resolved
Copy link
Contributor

@ijrsvt ijrsvt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One other comment:
Let's have the HTTPServer start in monitor.py and then pass in the AutoscalerPrometheusMetrics to StandardAutoscaler.

Copy link
Contributor

@ijrsvt ijrsvt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, can you add one test to test_autoscaler.py to test if it handles updating a few metrics. You can mock out a metrics by doing something like:

from unittest._mock import Mock
mock_metrics = Mock()

autoscaler = StandardAutoscaler(..., prom_metrics=mock_metrics)
# some calls

assert len(mock_metrics.<some_metric.inc.mock_calls) == <some_number>

Copy link
Contributor

@ijrsvt ijrsvt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@AmeerHajAli
Copy link
Contributor

@edoakes @kathryn-zhou , can you please review this as well? Are there other metrics you think are worth adding?

@ckw017
Copy link
Member Author

ckw017 commented May 28, 2021

Are there other metrics you think are worth adding?

Extra thing I spotted that might be useful would be number of failed/successful updates by just tracking the total counts in these two dicts

Copy link
Contributor

@AmeerHajAli AmeerHajAli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DmitriGekhtman can you please review this?

self.provider.create_node(node_config, node_tags, count)
startup_time = time.time() - launch_start_time
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this reflects startup time. For most (all?) providers, create_node sends a non-blocking API call to provision compute.
Startup time is the time from "create_node" to completion of ray start commands on the node, which in theory one could slightly overestimate as the time between
the autoscaler sticking the node into the launch queue
and
the node's first NodeUpdater thread completing.

Not sure how easy that is to measure.

@@ -126,6 +132,16 @@ def __init__(self,
self.autoscaling_config = autoscaling_config
self.autoscaler = None

self.prom_metrics = AutoscalerPrometheusMetrics()
try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be ok to skip this try-except block if monitor_ip is none (indicating we're not collecting autoscaler metrics)?
In general, it would be ideal if things looked the same externally as before when monitor_ip=None.

The Kubernetes Operator currently runs multiple monitor processes at the same ip, which is why we're skipping doing this support for k8s right now.

@ckw017 ckw017 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label May 29, 2021
Copy link
Contributor

@AmeerHajAli AmeerHajAli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't monitor_ip = redis_address? why do we need monitor_ip?

I think that if the goal was to make sure this works even when we move autoscaler out of the head node, then this can be a separate discussion because not sure how you can connect to the IP of a node not available in your network.

@DmitriGekhtman
Copy link
Contributor

shouldn't monitor_ip = redis_address? why do we need monitor_ip?

I think that if the goal was to make sure this works even when we move autoscaler out of the head node, then this can be a separate discussion because not sure how you can connect to the IP of a node not available in your network.

Right -- for the immediate moment, we don't actually need this.

In its current form, the K8s operator runs a bunch of monitors for different Ray clusters at a single IP
(in a single pod, distinct from any of the Ray head pods, potentially on another machine, but within the same K8s cluster).
We're planning to move each monitor into its own pod.
At that point, each monitor will live at its own ip distinct from the Ray head IP, and we will be able to support this functionality on K8s -- we will need to track the monitor ip at that point.

For right now, we should make sure the behavior of the K8s operator is unaffected by these changes.

tag_filters={TAG_RAY_NODE_KIND: NODE_KIND_WORKER})
# Update running nodes gauge whenever we check workers
self.prom_metrics.running_nodes.set(len(nodes))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is just workers, it does not include the head node, right?

Copy link
Contributor

@ijrsvt ijrsvt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a follow-up, can we add

  • Counter of total number of failed nodes.
  • Worker UpdaterThread time.

@@ -237,6 +244,7 @@ def _update(self):
self.provider.terminate_nodes(nodes_to_terminate)
for node in nodes_to_terminate:
self.node_tracker.untrack(node)
self.prom_metrics.stopped_nodes.inc()
nodes = self.workers()

to_launch = self.resource_demand_scheduler.get_nodes_to_launch(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also increment self.prom_metrics.stopped_nodes.inc() here:
Screen Shot 2021-05-31 at 3 06 04 PM

@AmeerHajAli
Copy link
Contributor

TODO in this PR:
add the terminated nodes under if failed_nodes: to stopped_nodes. P1
TODO in a follow up PR:
add the total number of failed updates (under if failed_nodes:) to failed_nodes. P1
add the number of failed to create nodes in node_launcher. P1
add the number of updating nodes at the end of update function. P1
add the time for launching a node. P2

@ijrsvt
Copy link
Contributor

ijrsvt commented Jun 1, 2021

Tested via a ray up cluster. I manually installed this commit using python setup-dev.py! Metrics were available at AUTOSCALER_METRIC_PORT = 44217!
Screen Shot 2021-05-31 at 8 45 46 PM

@ijrsvt ijrsvt merged commit 31364ed into ray-project:master Jun 1, 2021
DmitriGekhtman pushed a commit that referenced this pull request Jun 1, 2021
mwtian pushed a commit that referenced this pull request Jun 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants