Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[autoscaler] Autoscaler metrics #16066

Merged
merged 32 commits into from
Jun 1, 2021
Merged
Changes from 1 commit
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
46ea81b
initial
ckw017 May 24, 2021
7c7cd1a
sanity check
ckw017 May 24, 2021
f170974
lint and more
ckw017 May 25, 2021
4150bf0
remove extra file?
ckw017 May 25, 2021
771b304
format
ckw017 May 26, 2021
3f11bc6
store ip of machine running monitor process
ckw017 May 27, 2021
eb9edb1
autoscaler_ip -> monitor_ip
ckw017 May 27, 2021
b4130a8
lint
ckw017 May 27, 2021
09a8998
add worker startup time buckets
ckw017 May 27, 2021
849c1ba
better descriptions
ckw017 May 27, 2021
6f58b81
more lint
ckw017 May 27, 2021
9050bab
propogate exception when starting prom http
ckw017 May 27, 2021
7be69c1
lint
ckw017 May 27, 2021
8f682b4
fix redis set/get
ckw017 May 27, 2021
51f7a5c
move start_http to monitor.py
ckw017 May 27, 2021
4259ecd
break up exception types and add pending_nodes metric
ckw017 May 27, 2021
bb9896e
Adjust buckets, fix test_autoscaler failures
ckw017 May 27, 2021
091610a
Add metric_agent tests
ckw017 May 27, 2021
ee005b1
explain _AUTOSCALER_METRICS
ckw017 May 27, 2021
1d82167
add basic exception count checks
ckw017 May 27, 2021
a43b38e
more autoscaler metric tests
ckw017 May 28, 2021
25f55dc
less dangerous way to handle no prom_metrics
ckw017 May 28, 2021
f7013c2
more mock checks
ckw017 May 28, 2021
b2bd1e5
better docs
ckw017 May 28, 2021
a0c10f0
nits
ckw017 May 28, 2021
c069b8c
cases for started_nodes and worker_startup_time histogram
ckw017 May 28, 2021
2c18da5
add node_launch_exceptions case
ckw017 May 28, 2021
216060c
use waitFor
ckw017 May 28, 2021
377d2c8
don't start http server if monitor_ip isn't provided
ckw017 May 31, 2021
0f1a6b2
drop worker_startup_time
ckw017 May 31, 2021
a6cc229
lint
ckw017 May 31, 2021
2edcb1a
Hotfix [nodes -> workers] + [count failed nodes as stopped]
ijrsvt May 31, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
use waitFor
  • Loading branch information
ckw017 committed May 28, 2021
commit 216060c2ff3527e621fba84ac9cb8e71eb0cd86a
15 changes: 8 additions & 7 deletions python/ray/tests/test_autoscaler.py
Original file line number Diff line number Diff line change
Expand Up @@ -2391,7 +2391,7 @@ def terminate_worker_zero():
assert set(autoscaler.workers()) == {2, 3},\
"Unexpected node_ids"

def testFlakyProvider(self):
def testProviderException(self):
config_path = self.write_config(SMALL_CLUSTER)
self.provider = MockProvider()
self.provider.error_creates = True
Expand All @@ -2405,12 +2405,13 @@ def testFlakyProvider(self):
update_interval_s=0,
prom_metrics=mock_metrics)
autoscaler.update()
for _ in range(50):
if mock_metrics.node_launch_exceptions.inc.call_count == 1:
break
time.sleep(.1)
assert mock_metrics.node_launch_exceptions.inc.call_count == 1,\
"Expected to observe a node launch exception"

def exceptions_incremented():
return mock_metrics.node_launch_exceptions.inc.call_count == 1

self.waitFor(
exceptions_incremented,
fail_msg="Expected to see a node launch exception")


if __name__ == "__main__":
Expand Down