Don't start dashboard agent when missing dependencies #17966

edoakes · 2021-08-19T21:17:02Z

Why are these changes needed?

Avoids spawning the agent altogether if the relevant dependencies are missing. We use the presence of some required optional pip packages for the check here, so it's important to keep them in sync between the dashboard code and the check to start it. This is achieved by moving optional dependencies to their own submodule within the dashboard, which is imported both by the code that decides whether or not to start the dashboard and the code in the dashboard that actually uses the packages.

Related issue number

Closes #17965

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

edoakes · 2021-08-19T21:20:08Z

@richardliaw do you have any pointers to tests we have for the minimal vs default distribution? Not sure if there's an existing way to test the behavior with a specific set of deps.

edoakes · 2021-08-19T21:27:50Z

Ok, it looks like our exit code detection doesn't actually work :( this always returns 0 because it's using some file descriptor magic to determine if the process died...

rkooo567

~~Oh nice! The fix is simple :)~~

Nvm.. just saw your message...

rkooo567 · 2021-08-19T21:31:09Z

Maybe we can have some retry count with exponential backoff? Like

10ms -> 20ms -> 40ms -> 80ms -> 2s -> 10s -> stop restarting

edoakes · 2021-08-19T21:38:14Z

Yeah @rkooo567 I was thinking something similar... we can try restarting it some number of times and if it never succeeds then we stop restarting all together. Another option would be to just continue restarting at some long interval forever.

edoakes · 2021-08-19T21:40:52Z

We may also be able to just avoid spawning the agent altogether if we don't have the right deps... let me see how hard that would be

python/ray/_private/services.py

rkooo567 · 2021-08-20T05:11:13Z

@edoakes I actually feel like this approach can be pretty nice and provide clean error messages.

Maybe we can have some retry count with exponential backoff? Like

10ms -> 20ms -> 40ms -> 80ms -> 2s -> 10s -> stop restarting

I think this is something we can do;

When the raylet starts a dashboard, it sets the restart_cnt = 0 and backoff_interval=100ms
Agent starts. If it fails with restart_cnt != max_restart_cnt, don't publish any error log.
Raylet restarts the agent and set backoff_internval *= 2 & restart_cnt += 1
Repeat until restart_cnt == max_restart_cnt
If agent is alive for more than X seconds, reset restart_cnt and backoff_interval
If agent dies and restart_cnt == max_restart_cnt, do the following;

If the agent fails with an exception, publish a short message; Dashboard agent at IP="" died. You might lose metrics and information from the dashboard. Check logs X to see the root cause of failure.
- real issue case, but we can provide a clean logs with less verbose error messages)
If the agent fails with the dependency error, don't print any error.
- no dep case

…dashboard-agent-restart

edoakes · 2021-08-20T15:39:37Z

@rkooo567 I agree that would probably be improved logic for the restarting, but I think the best solution here is to just avoid starting the agent when we know it will fail (it's a self-inflicted problem...).

@fyrestone @richardliaw I updated the PR to standardize how we do this for the dashboard and agent, please have another look.

rkooo567 · 2021-08-20T18:21:07Z

Sure. I think we can just do better restarting / error message in other PRs. (it seems to be important for usability). I can create an issue for that

…7966)

edoakes assigned rkooo567 Aug 19, 2021

edoakes assigned fyrestone Aug 19, 2021

rkooo567 approved these changes Aug 19, 2021

View reviewed changes

WIP

b663e11

edoakes force-pushed the fix-dashboard-agent-restart branch from c140382 to b663e11 Compare August 19, 2021 21:51

edoakes assigned richardliaw Aug 19, 2021

edoakes changed the title ~~[WIP] Don't restart agent when exiting due to missing depedencies~~ Don't start dashboard agent when missing depedencies Aug 19, 2021

edoakes changed the title ~~Don't start dashboard agent when missing depedencies~~ Don't start dashboard agent when missing dependencies Aug 19, 2021

richardliaw reviewed Aug 20, 2021

View reviewed changes

python/ray/_private/services.py Outdated Show resolved Hide resolved

fyrestone reviewed Aug 20, 2021

View reviewed changes

python/ray/_private/services.py Outdated Show resolved Hide resolved

edoakes added 2 commits August 20, 2021 10:10

Merge branch 'master' of https://github.com/ray-project/ray into fix-…

c781543

…dashboard-agent-restart

move deps to their own file

ed09cd9

edoakes added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Aug 20, 2021

richardliaw merged commit b969aa3 into ray-project:master Aug 21, 2021

richardliaw pushed a commit that referenced this pull request Aug 21, 2021

[dashboard] Don't start dashboard agent when missing dependencies (#1…

7916500

…7966)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't start dashboard agent when missing dependencies #17966

Don't start dashboard agent when missing dependencies #17966

edoakes commented Aug 19, 2021 •

edited

Loading

edoakes commented Aug 19, 2021

edoakes commented Aug 19, 2021

rkooo567 left a comment •

edited

Loading

rkooo567 commented Aug 19, 2021

edoakes commented Aug 19, 2021 •

edited

Loading

edoakes commented Aug 19, 2021

rkooo567 commented Aug 20, 2021 •

edited

Loading

edoakes commented Aug 20, 2021

rkooo567 commented Aug 20, 2021

Don't start dashboard agent when missing dependencies #17966

Don't start dashboard agent when missing dependencies #17966

Conversation

edoakes commented Aug 19, 2021 • edited Loading

Why are these changes needed?

Related issue number

Checks

edoakes commented Aug 19, 2021

edoakes commented Aug 19, 2021

rkooo567 left a comment • edited Loading

Choose a reason for hiding this comment

rkooo567 commented Aug 19, 2021

edoakes commented Aug 19, 2021 • edited Loading

edoakes commented Aug 19, 2021

rkooo567 commented Aug 20, 2021 • edited Loading

edoakes commented Aug 20, 2021

rkooo567 commented Aug 20, 2021

edoakes commented Aug 19, 2021 •

edited

Loading

rkooo567 left a comment •

edited

Loading

edoakes commented Aug 19, 2021 •

edited

Loading

rkooo567 commented Aug 20, 2021 •

edited

Loading