-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Dashboard] Start the new dashboard #9860
Conversation
…w dashboard startup
Can one of the admins verify this patch? |
Test FAILed. |
Test FAILed. |
Test FAILed. |
Test FAILed. |
Test FAILed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM in general. I have several questions though.
- It is really lacking tests. We need A LOT of tests to make sure it works properly. What's you guys test plan?
- What's the reason why each agents run http servers? It seems to be unncessary.
- Also, what's the reason why the head dashboard has grpc endpoints?
app.add_routes(routes=routes.bound_routes()) | ||
|
||
# Enable CORS on all routes. | ||
cors = aiohttp_cors.setup( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously, we disabled cors when it was dev mode because frontend usually wants to start a react server. We aren't doing that anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, the CORS
is for the new log
module. Each agent starts a http server for log file access. Allow new dashboard frontend to access log file as static resource directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, so we allow access to all methods on all routes of the app. Isn't this a security risk?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dashboard
has the same access level with agent
. We can access all routes of dashboard
, why disallow access to all routes of agent
?
The log
http server on each agent
can reduce the load of the dashboard
.
logger.info("Registered %s routes.", len(dump_routes)) | ||
|
||
# Write the dashboard agent port to redis. | ||
await self.aioredis_client.set( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should remove this from Redis when agents are shutdown. One of the most common workflow of Ray is using autoscaling clusters (and we will invest lots of resources on this), and this means agents will be killed occasionally from the cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The agent will be restarted after killed, and the new value will be set to Redis. So, can we leave this from Redis when agents are shutdown?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a node is removed from the cluster, will the redis entry for the agent from that node be cleaned up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, If a node is removed from the cluster, the entry for the agent
will not be cleaned up. Dashboard
use the node ip
to access agent entry, so the dead agent entry is not a problem. If we really need to cleanup the dead agent entry, we can:
- cleanup the agent entry by GCS. (node can't ensure the cleanup)
- set TTL for the agent entry, agent will update the TTL periodically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is not a big deal for now if the only concern is having unused entries to GCS.
|
||
RAY_LOG(WARNING) << "Agent process with pid " << child.GetId() | ||
<< " exit, return value " << exit_code; | ||
RAY_UNUSED(delay_executor_([this] { StartAgent(); }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should kill this thread in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The monitor_thread
has been detached. So, the thread will be exit after this lambda returns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that makes sense. I was confused.
Co-authored-by: SangBin Cho <[email protected]>
Test FAILed. |
Co-authored-by: SangBin Cho <[email protected]>
Test FAILed. |
Test FAILed. |
Test FAILed. |
src/ray/util/process.cc
Outdated
@@ -395,7 +395,7 @@ int Process::Wait() const { | |||
return status; | |||
} | |||
|
|||
void Process::Kill() { | |||
void Process::Kill() const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe there are some risks of the Process, i don't see any unit test for the Process.
To be honest i don't think it is necessary to write such a Process, because boost.process has already done all the things, i recommend that we use boost.process instead of current Process.
@mehrdadn @rkooo567 @mfitton
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually Boost.Process is what we tried before this. It turned out that Boost.Process was quite unstable and had bugs (and slowed down compilation dramatically too), and after struggling with it for over 1 month, we had to scrap it and roll our own solution, which works fine. Please don't revert back to it; it is very likely to cause problems again.
@fyrestone Regarding the const
, this shouldn't be const
. What was the reason for adding const
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually Boost.Process is what we tried before this. It turned out that Boost.Process was quite unstable and had bugs (and slowed down compilation dramatically too), and after struggling with it for over 1 month, we had to scrap it and roll our own solution, which works fine. Please don't revert back to it; it is very likely to cause problems again.
@fyrestone Regarding the
const
, this shouldn't beconst
. What was the reason for addingconst
here?
Got it. It's necessary to add some useful interface for Process: WaitFor, WaitUntil.
Another suggestion, Process should forbidden copy construct and copy assign function, just move only will be better.
Hope you could add unit tests for Process, especially the exception tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are not trivial changes to make correctly. I don't think we should alter this specific class without compelling reasons.
The entire design of the API here was honestly quite deliberate. For example, I specifically allowed copying because the initial code used pid_t
everywhere and this ensured semantic equivalence and minimized the changes of bugs being introduced. Requiring unique ownership would require a significant refactor of the rest of the codebase to accommodate it. If someone feels compelled to do it and can do it without introducing bugs, I suppose they can go ahead, but it is a significant effort and in the end there seems to be very little to gain.
Note that the primary intention for this class is to provide an easy-to-use interface that works correctly across all operating systems. While there are certain features missing that could be added and potentially come in handy, the limited interface here was primarily designed to satisfy our actual needs without adding unnecessary maintenance burden. Process management is possibly the biggest source of pain when porting code across operating systems, and every change would have to be shown to work correctly on all 3 platforms with comparable semantics.
Overall, this is one of the few classes classes where it's not a good idea to change it just for the sake of elegance. If there is a functionality we actually need, then perhaps that warrants adding it. But as long as it is working correctly and other solutions can be reasonably implemented on top, it should probably be left alone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually Boost.Process is what we tried before this. It turned out that Boost.Process was quite unstable and had bugs (and slowed down compilation dramatically too), and after struggling with it for over 1 month, we had to scrap it and roll our own solution, which works fine. Please don't revert back to it; it is very likely to cause problems again.
@fyrestone Regarding the
const
, this shouldn't beconst
. What was the reason for addingconst
here?
I have removed const from Process::Kill().
Test FAILed. |
Test FAILed. |
Test FAILed. |
Please make sure to pass all CIs. Also, please don't merge the PR until @mfitton approves! |
Test FAILed. |
Approved, but as Sang said, let's be careful not to break CI when merging. |
This reverts commit 739933e.
This PR contains the following features:
RAY_USE_NEW_DASHBOARD
environment var beforeray.init(include_dashboard=True)
will stat the new dashboard.RAY_USE_NEW_DASHBOARD
environment var exists.ClassMethodRouteTable
for head and agent http routes.Why are these changes needed?
Related issue number
Checks
scripts/format.sh
to lint the changes in this PR.