Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Dashboard] Start the new dashboard #9860

Merged
merged 28 commits into from
Aug 13, 2020

Conversation

fyrestone
Copy link
Contributor

@fyrestone fyrestone commented Aug 1, 2020

This PR contains the following features:

  • set the RAY_USE_NEW_DASHBOARD environment var before ray.init(include_dashboard=True) will stat the new dashboard.
  • raylet fork new dashboard agent if RAY_USE_NEW_DASHBOARD environment var exists.
  • New dashboard head starts a grpc server (all head modules can extend the grpc server).
  • New dashboard agent starts a http server (all agent modules can extend the http server).
  • Add basic tests for new dashboard.
  • ClassMethodRouteTable for head and agent http routes.
  • Port features from reporter.py to reporter module.
  • Some other improvements.

Why are these changes needed?

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/latest/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failure rates at https://ray-travis-tracker.herokuapp.com/.
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested (please justify below)

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@fyrestone fyrestone changed the title [WIP][Dashboard] Start new dashboard [Dashboard] Start new dashboard Aug 1, 2020
@fyrestone fyrestone changed the title [Dashboard] Start new dashboard [Dashboard] Start the new dashboard Aug 1, 2020
@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/29276/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/29277/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/29281/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/29282/
Test FAILed.

@rkooo567 rkooo567 self-assigned this Aug 3, 2020
@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/29424/
Test FAILed.

Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM in general. I have several questions though.

  1. It is really lacking tests. We need A LOT of tests to make sure it works properly. What's you guys test plan?
  2. What's the reason why each agents run http servers? It seems to be unncessary.
  3. Also, what's the reason why the head dashboard has grpc endpoints?

dashboard/agent.py Show resolved Hide resolved
dashboard/agent.py Outdated Show resolved Hide resolved
app.add_routes(routes=routes.bound_routes())

# Enable CORS on all routes.
cors = aiohttp_cors.setup(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, we disabled cors when it was dev mode because frontend usually wants to start a react server. We aren't doing that anymore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, the CORS is for the new log module. Each agent starts a http server for log file access. Allow new dashboard frontend to access log file as static resource directly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, so we allow access to all methods on all routes of the app. Isn't this a security risk?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dashboard has the same access level with agent. We can access all routes of dashboard, why disallow access to all routes of agent?

The loghttp server on each agent can reduce the load of the dashboard.

logger.info("Registered %s routes.", len(dump_routes))

# Write the dashboard agent port to redis.
await self.aioredis_client.set(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should remove this from Redis when agents are shutdown. One of the most common workflow of Ray is using autoscaling clusters (and we will invest lots of resources on this), and this means agents will be killed occasionally from the cluster.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The agent will be restarted after killed, and the new value will be set to Redis. So, can we leave this from Redis when agents are shutdown?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a node is removed from the cluster, will the redis entry for the agent from that node be cleaned up?

Copy link
Contributor Author

@fyrestone fyrestone Aug 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, If a node is removed from the cluster, the entry for the agent will not be cleaned up. Dashboard use the node ip to access agent entry, so the dead agent entry is not a problem. If we really need to cleanup the dead agent entry, we can:

  • cleanup the agent entry by GCS. (node can't ensure the cleanup)
  • set TTL for the agent entry, agent will update the TTL periodically.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not a big deal for now if the only concern is having unused entries to GCS.

dashboard/head.py Outdated Show resolved Hide resolved
python/setup.py Outdated Show resolved Hide resolved
python/setup.py Outdated Show resolved Hide resolved
src/ray/gcs/gcs_server/gcs_actor_scheduler.cc Show resolved Hide resolved
src/ray/raylet/main.cc Outdated Show resolved Hide resolved

RAY_LOG(WARNING) << "Agent process with pid " << child.GetId()
<< " exit, return value " << exit_code;
RAY_UNUSED(delay_executor_([this] { StartAgent(); },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should kill this thread in this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The monitor_thread has been detached. So, the thread will be exit after this lambda returns.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that makes sense. I was confused.

Co-authored-by: SangBin Cho <[email protected]>
@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/29429/
Test FAILed.

Co-authored-by: SangBin Cho <[email protected]>
@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/29444/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/29501/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/29543/
Test FAILed.

@@ -395,7 +395,7 @@ int Process::Wait() const {
return status;
}

void Process::Kill() {
void Process::Kill() const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe there are some risks of the Process, i don't see any unit test for the Process.
To be honest i don't think it is necessary to write such a Process, because boost.process has already done all the things, i recommend that we use boost.process instead of current Process.
@mehrdadn @rkooo567 @mfitton

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually Boost.Process is what we tried before this. It turned out that Boost.Process was quite unstable and had bugs (and slowed down compilation dramatically too), and after struggling with it for over 1 month, we had to scrap it and roll our own solution, which works fine. Please don't revert back to it; it is very likely to cause problems again.

@fyrestone Regarding the const, this shouldn't be const. What was the reason for adding const here?

Copy link
Contributor

@qicosmos qicosmos Aug 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually Boost.Process is what we tried before this. It turned out that Boost.Process was quite unstable and had bugs (and slowed down compilation dramatically too), and after struggling with it for over 1 month, we had to scrap it and roll our own solution, which works fine. Please don't revert back to it; it is very likely to cause problems again.

@fyrestone Regarding the const, this shouldn't be const. What was the reason for adding const here?

Got it. It's necessary to add some useful interface for Process: WaitFor, WaitUntil.
Another suggestion, Process should forbidden copy construct and copy assign function, just move only will be better.
Hope you could add unit tests for Process, especially the exception tests.

Copy link
Contributor

@mehrdadn mehrdadn Aug 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are not trivial changes to make correctly. I don't think we should alter this specific class without compelling reasons.

The entire design of the API here was honestly quite deliberate. For example, I specifically allowed copying because the initial code used pid_t everywhere and this ensured semantic equivalence and minimized the changes of bugs being introduced. Requiring unique ownership would require a significant refactor of the rest of the codebase to accommodate it. If someone feels compelled to do it and can do it without introducing bugs, I suppose they can go ahead, but it is a significant effort and in the end there seems to be very little to gain.

Note that the primary intention for this class is to provide an easy-to-use interface that works correctly across all operating systems. While there are certain features missing that could be added and potentially come in handy, the limited interface here was primarily designed to satisfy our actual needs without adding unnecessary maintenance burden. Process management is possibly the biggest source of pain when porting code across operating systems, and every change would have to be shown to work correctly on all 3 platforms with comparable semantics.

Overall, this is one of the few classes classes where it's not a good idea to change it just for the sake of elegance. If there is a functionality we actually need, then perhaps that warrants adding it. But as long as it is working correctly and other solutions can be reasonably implemented on top, it should probably be left alone.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually Boost.Process is what we tried before this. It turned out that Boost.Process was quite unstable and had bugs (and slowed down compilation dramatically too), and after struggling with it for over 1 month, we had to scrap it and roll our own solution, which works fine. Please don't revert back to it; it is very likely to cause problems again.

@fyrestone Regarding the const, this shouldn't be const. What was the reason for adding const here?

I have removed const from Process::Kill().

src/ray/util/asio_util.h Show resolved Hide resolved
src/ray/raylet/agent_manager.cc Show resolved Hide resolved
@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/29557/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/29603/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/29645/
Test FAILed.

@rkooo567
Copy link
Contributor

Please make sure to pass all CIs. Also, please don't merge the PR until @mfitton approves!

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/29710/
Test FAILed.

@mfitton
Copy link
Contributor

mfitton commented Aug 11, 2020

Approved, but as Sang said, let's be careful not to break CI when merging.

@raulchen raulchen merged commit 739933e into ray-project:master Aug 13, 2020
@raulchen raulchen deleted the dashboard_switch branch August 13, 2020 03:01
robertnishihara added a commit to robertnishihara/ray that referenced this pull request Aug 14, 2020
rkooo567 pushed a commit that referenced this pull request Aug 14, 2020
@fyrestone fyrestone restored the dashboard_switch branch August 15, 2020 09:10
@fyrestone fyrestone mentioned this pull request Jan 7, 2021
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants