[autoscaler] Autoscaler metrics #16066

ckw017 · 2021-05-25T20:12:43Z

Why are these changes needed?

Exposes the following metrics to provide better observability into the autoscaler:

started_nodes (counter)
stopped_nodes (counter)
pending_nodes (gauge)
running_nodes (gauge)
worker_startup_time (histogram)
update_loop_exceptions (counter)
reset_exceptions (counter)
node_launch_exceptions (counter)
config_validation_exceptions (counter)

Metrics are exposed by default on port 44217 of whichever machine hosts the monitor, which can be overwritten with the environment variable AUTOSCALER_METRIC_PORT.

Related issue number

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

python/ray/autoscaler/_private/prom_metrics.py

python/ray/autoscaler/_private/autoscaler.py

python/ray/autoscaler/_private/node_launcher.py

ijrsvt

Have a PromtheusMetrics Class.

have that as a member variable in the autoscaler.

ijrsvt

Just need some tests:

One for metrics.json (that it is populated)
One for test_autoscaler (that the values are correct).

python/ray/autoscaler/_private/prom_metrics.py

python/ray/_private/metrics_agent.py

python/ray/autoscaler/_private/autoscaler.py

ijrsvt

One other comment:
Let's have the HTTPServer start in monitor.py and then pass in the AutoscalerPrometheusMetrics to StandardAutoscaler.

python/ray/autoscaler/_private/node_launcher.py

ijrsvt

Looks good, can you add one test to test_autoscaler.py to test if it handles updating a few metrics. You can mock out a metrics by doing something like:

from unittest._mock import Mock
mock_metrics = Mock()

autoscaler = StandardAutoscaler(..., prom_metrics=mock_metrics)
# some calls

assert len(mock_metrics.<some_metric.inc.mock_calls) == <some_number>

ijrsvt

Looks good!

AmeerHajAli · 2021-05-28T18:21:03Z

@edoakes @kathryn-zhou , can you please review this as well? Are there other metrics you think are worth adding?

ckw017 · 2021-05-28T19:19:24Z

Are there other metrics you think are worth adding?

Extra thing I spotted that might be useful would be number of failed/successful updates by just tracking the total counts in these two dicts

AmeerHajAli

@DmitriGekhtman can you please review this?

DmitriGekhtman · 2021-05-29T00:58:13Z

python/ray/autoscaler/_private/node_launcher.py

 self.provider.create_node(node_config, node_tags, count)
+ startup_time = time.time() - launch_start_time


I don't think this reflects startup time. For most (all?) providers, create_node sends a non-blocking API call to provision compute.
Startup time is the time from "create_node" to completion of ray start commands on the node, which in theory one could slightly overestimate as the time between
the autoscaler sticking the node into the launch queue
and
the node's first NodeUpdater thread completing.

Not sure how easy that is to measure.

DmitriGekhtman · 2021-05-29T01:04:59Z

python/ray/autoscaler/_private/monitor.py

@@ -126,6 +132,16 @@ def __init__(self,
 self.autoscaling_config = autoscaling_config
 self.autoscaler = None

+ self.prom_metrics = AutoscalerPrometheusMetrics()
+ try:


Would it be ok to skip this try-except block if monitor_ip is none (indicating we're not collecting autoscaler metrics)?
In general, it would be ideal if things looked the same externally as before when monitor_ip=None.

The Kubernetes Operator currently runs multiple monitor processes at the same ip, which is why we're skipping doing this support for k8s right now.

AmeerHajAli

shouldn't monitor_ip = redis_address? why do we need monitor_ip?

I think that if the goal was to make sure this works even when we move autoscaler out of the head node, then this can be a separate discussion because not sure how you can connect to the IP of a node not available in your network.

DmitriGekhtman · 2021-05-30T15:56:34Z

shouldn't monitor_ip = redis_address? why do we need monitor_ip?

I think that if the goal was to make sure this works even when we move autoscaler out of the head node, then this can be a separate discussion because not sure how you can connect to the IP of a node not available in your network.

Right -- for the immediate moment, we don't actually need this.

In its current form, the K8s operator runs a bunch of monitors for different Ray clusters at a single IP
(in a single pod, distinct from any of the Ray head pods, potentially on another machine, but within the same K8s cluster).
We're planning to move each monitor into its own pod.
At that point, each monitor will live at its own ip distinct from the Ray head IP, and we will be able to support this functionality on K8s -- we will need to track the monitor ip at that point.

For right now, we should make sure the behavior of the K8s operator is unaffected by these changes.

AmeerHajAli · 2021-05-31T21:48:01Z

python/ray/autoscaler/_private/autoscaler.py

 tag_filters={TAG_RAY_NODE_KIND: NODE_KIND_WORKER})
+ # Update running nodes gauge whenever we check workers
+ self.prom_metrics.running_nodes.set(len(nodes))


this is just workers, it does not include the head node, right?

python/ray/autoscaler/_private/prom_metrics.py

ijrsvt

In a follow-up, can we add

Counter of total number of failed nodes.
Worker UpdaterThread time.

ijrsvt · 2021-05-31T22:06:26Z

python/ray/autoscaler/_private/autoscaler.py

@@ -237,6 +244,7 @@ def _update(self):
 self.provider.terminate_nodes(nodes_to_terminate)
 for node in nodes_to_terminate:
 self.node_tracker.untrack(node)
+ self.prom_metrics.stopped_nodes.inc()
 nodes = self.workers()

 to_launch = self.resource_demand_scheduler.get_nodes_to_launch(


We should also increment self.prom_metrics.stopped_nodes.inc() here:

AmeerHajAli · 2021-05-31T22:09:09Z

TODO in this PR:
add the terminated nodes under if failed_nodes: to stopped_nodes. P1
TODO in a follow up PR:
add the total number of failed updates (under if failed_nodes:) to failed_nodes. P1
add the number of failed to create nodes in node_launcher. P1
add the number of updating nodes at the end of update function. P1
add the time for launching a node. P2

Resolved.

ijrsvt · 2021-06-01T03:48:13Z

Tested via a ray up cluster. I manually installed this commit using python setup-dev.py! Metrics were available at AUTOSCALER_METRIC_PORT = 44217!

Co-authored-by: Ian <[email protected]>

ijrsvt reviewed May 25, 2021

View reviewed changes

python/ray/autoscaler/_private/prom_metrics.py Outdated Show resolved Hide resolved

ijrsvt reviewed May 25, 2021

View reviewed changes

python/ray/autoscaler/_private/autoscaler.py Outdated Show resolved Hide resolved

ijrsvt reviewed May 25, 2021

View reviewed changes

python/ray/autoscaler/_private/node_launcher.py Outdated Show resolved Hide resolved

ijrsvt reviewed May 25, 2021

View reviewed changes

ckw017 added 12 commits May 27, 2021 09:02

initial

46ea81b

sanity check

7c7cd1a

lint and more

f170974

remove extra file?

4150bf0

format

771b304

store ip of machine running monitor process

3f11bc6

autoscaler_ip -> monitor_ip

eb9edb1

lint

b4130a8

add worker startup time buckets

09a8998

better descriptions

849c1ba

more lint

6f58b81

propogate exception when starting prom http

9050bab

ckw017 force-pushed the autoscaler_metrics branch from 0909b5e to 9050bab Compare May 27, 2021 18:18

ckw017 added 2 commits May 27, 2021 11:49

lint

7be69c1

fix redis set/get

8f682b4

ijrsvt reviewed May 27, 2021

View reviewed changes

ckw017 added 6 commits May 27, 2021 14:24

move start_http to monitor.py

51f7a5c

break up exception types and add pending_nodes metric

4259ecd

Adjust buckets, fix test_autoscaler failures

bb9896e

Add metric_agent tests

091610a

explain _AUTOSCALER_METRICS

ee005b1

add basic exception count checks

1d82167

ijrsvt reviewed May 28, 2021

View reviewed changes

python/ray/autoscaler/_private/node_launcher.py Show resolved Hide resolved

ijrsvt reviewed May 28, 2021

View reviewed changes

more autoscaler metric tests

a43b38e

ijrsvt approved these changes May 28, 2021

View reviewed changes

use waitFor

216060c

ckw017 requested a review from AmeerHajAli May 28, 2021 16:46

ckw017 assigned AmeerHajAli May 28, 2021

AmeerHajAli assigned edoakes and kathryn-zhou May 28, 2021

AmeerHajAli reviewed May 28, 2021

View reviewed changes

DmitriGekhtman previously requested changes May 29, 2021

View reviewed changes

ckw017 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label May 29, 2021

AmeerHajAli reviewed May 29, 2021

View reviewed changes

ckw017 added 3 commits May 30, 2021 22:30

don't start http server if monitor_ip isn't provided

377d2c8

drop worker_startup_time

0f1a6b2

lint

a6cc229

AmeerHajAli reviewed May 31, 2021

View reviewed changes

ijrsvt reviewed May 31, 2021

View reviewed changes

python/ray/autoscaler/_private/prom_metrics.py Outdated Show resolved Hide resolved

ijrsvt reviewed May 31, 2021

View reviewed changes

AmeerHajAli approved these changes May 31, 2021

View reviewed changes

AmeerHajAli requested a review from DmitriGekhtman May 31, 2021 22:09

Hotfix [nodes -> workers] + [count failed nodes as stopped]

2edcb1a

ijrsvt merged commit 31364ed into ray-project:master Jun 1, 2021

DmitriGekhtman pushed a commit that referenced this pull request Jun 1, 2021

[autoscaler] Autoscaler metrics (#16066)

c11809b

Co-authored-by: Ian <[email protected]>

mwtian pushed a commit that referenced this pull request Jun 1, 2021

[autoscaler] Autoscaler metrics (#16066)

d0e6716

Co-authored-by: Ian <[email protected]>

ckw017 mentioned this pull request Jun 2, 2021

[autoscaler] Additional Autoscaler Metrics #16198

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[autoscaler] Autoscaler metrics #16066

[autoscaler] Autoscaler metrics #16066

ckw017 commented May 25, 2021 •

edited

Loading

ijrsvt left a comment

ijrsvt left a comment

ijrsvt left a comment

ijrsvt left a comment

ijrsvt left a comment

AmeerHajAli commented May 28, 2021

ckw017 commented May 28, 2021

AmeerHajAli left a comment

DmitriGekhtman May 29, 2021

DmitriGekhtman May 29, 2021

AmeerHajAli left a comment

DmitriGekhtman commented May 30, 2021

AmeerHajAli May 31, 2021

ijrsvt left a comment

ijrsvt May 31, 2021

AmeerHajAli commented May 31, 2021

ijrsvt commented Jun 1, 2021

		self.provider.create_node(node_config, node_tags, count)
		startup_time = time.time() - launch_start_time

[autoscaler] Autoscaler metrics #16066

[autoscaler] Autoscaler metrics #16066

Conversation

ckw017 commented May 25, 2021 • edited Loading

Why are these changes needed?

Related issue number

Checks

ijrsvt left a comment

Choose a reason for hiding this comment

ijrsvt left a comment

Choose a reason for hiding this comment

ijrsvt left a comment

Choose a reason for hiding this comment

ijrsvt left a comment

Choose a reason for hiding this comment

ijrsvt left a comment

Choose a reason for hiding this comment

AmeerHajAli commented May 28, 2021

ckw017 commented May 28, 2021

AmeerHajAli left a comment

Choose a reason for hiding this comment

DmitriGekhtman May 29, 2021

Choose a reason for hiding this comment

DmitriGekhtman May 29, 2021

Choose a reason for hiding this comment

AmeerHajAli left a comment

Choose a reason for hiding this comment

DmitriGekhtman commented May 30, 2021

AmeerHajAli May 31, 2021

Choose a reason for hiding this comment

ijrsvt left a comment

Choose a reason for hiding this comment

ijrsvt May 31, 2021

Choose a reason for hiding this comment

AmeerHajAli commented May 31, 2021

ijrsvt commented Jun 1, 2021

ckw017 commented May 25, 2021 •

edited

Loading