Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't start dashboard agent when missing dependencies #17966

Merged
merged 3 commits into from
Aug 21, 2021

Conversation

edoakes
Copy link
Contributor

@edoakes edoakes commented Aug 19, 2021

Why are these changes needed?

Avoids spawning the agent altogether if the relevant dependencies are missing. We use the presence of some required optional pip packages for the check here, so it's important to keep them in sync between the dashboard code and the check to start it. This is achieved by moving optional dependencies to their own submodule within the dashboard, which is imported both by the code that decides whether or not to start the dashboard and the code in the dashboard that actually uses the packages.

Related issue number

Closes #17965

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@edoakes
Copy link
Contributor Author

edoakes commented Aug 19, 2021

@richardliaw do you have any pointers to tests we have for the minimal vs default distribution? Not sure if there's an existing way to test the behavior with a specific set of deps.

@edoakes
Copy link
Contributor Author

edoakes commented Aug 19, 2021

Ok, it looks like our exit code detection doesn't actually work :( this always returns 0 because it's using some file descriptor magic to determine if the process died...

Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh nice! The fix is simple :)

Nvm.. just saw your message...

@rkooo567
Copy link
Contributor

Maybe we can have some retry count with exponential backoff? Like

10ms -> 20ms -> 40ms -> 80ms -> 2s -> 10s -> stop restarting

@edoakes
Copy link
Contributor Author

edoakes commented Aug 19, 2021

Yeah @rkooo567 I was thinking something similar... we can try restarting it some number of times and if it never succeeds then we stop restarting all together. Another option would be to just continue restarting at some long interval forever.

@edoakes
Copy link
Contributor Author

edoakes commented Aug 19, 2021

We may also be able to just avoid spawning the agent altogether if we don't have the right deps... let me see how hard that would be

@edoakes edoakes force-pushed the fix-dashboard-agent-restart branch from c140382 to b663e11 Compare August 19, 2021 21:51
@edoakes edoakes changed the title [WIP] Don't restart agent when exiting due to missing depedencies Don't start dashboard agent when missing depedencies Aug 19, 2021
@edoakes edoakes changed the title Don't start dashboard agent when missing depedencies Don't start dashboard agent when missing dependencies Aug 19, 2021
@rkooo567
Copy link
Contributor

rkooo567 commented Aug 20, 2021

@edoakes I actually feel like this approach can be pretty nice and provide clean error messages.

Maybe we can have some retry count with exponential backoff? Like

10ms -> 20ms -> 40ms -> 80ms -> 2s -> 10s -> stop restarting

I think this is something we can do;

  1. When the raylet starts a dashboard, it sets the restart_cnt = 0 and backoff_interval=100ms
  2. Agent starts. If it fails with restart_cnt != max_restart_cnt, don't publish any error log.
  3. Raylet restarts the agent and set backoff_internval *= 2 & restart_cnt += 1
  4. Repeat until restart_cnt == max_restart_cnt
  5. If agent is alive for more than X seconds, reset restart_cnt and backoff_interval
  6. If agent dies and restart_cnt == max_restart_cnt, do the following;
  • If the agent fails with an exception, publish a short message; Dashboard agent at IP="" died. You might lose metrics and information from the dashboard. Check logs X to see the root cause of failure.
    • real issue case, but we can provide a clean logs with less verbose error messages)
  • If the agent fails with the dependency error, don't print any error.
    • no dep case

@edoakes
Copy link
Contributor Author

edoakes commented Aug 20, 2021

@rkooo567 I agree that would probably be improved logic for the restarting, but I think the best solution here is to just avoid starting the agent when we know it will fail (it's a self-inflicted problem...).

@fyrestone @richardliaw I updated the PR to standardize how we do this for the dashboard and agent, please have another look.

@rkooo567
Copy link
Contributor

Sure. I think we can just do better restarting / error message in other PRs. (it seems to be important for usability). I can create an issue for that

@edoakes edoakes added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Aug 20, 2021
@richardliaw richardliaw merged commit b969aa3 into ray-project:master Aug 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
4 participants