-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Dashboard] Dashboard basic modules #10303
[Dashboard] Dashboard basic modules #10303
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It LGTM in general (and nice addition of tests :)), but since the PR is really large, I will have another batch of review later. Next time when you submit a PR, please break them down (this could be broken down to 3 PRs imo).
@@ -74,6 +76,22 @@ async def _update_nodes(self): | |||
while True: | |||
try: | |||
nodes = await self._get_nodes() | |||
|
|||
# Get correct node info by state, | |||
# 1. The node is ALIVE if any ALIVE node info |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain the reason in the comment too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GetAllNodeInfo
returns all nodes info including dead nodes. We needs to know how to merge them to hostname. For example, there are two nodes info:
- {node id: 1, hostname: example}
- {node id: 2, hostname: example}
We choose which one for host example
?
For one hostname, there will be a list of node info with only two cases:
- All nodes info of one hostname are DEAD
- Only one node info of one hostname is ALIVE
So, here is the rule,
- Choose a DEAD one if all nodes info of one hostname are DEAD.
- Choose the ALIVE one if there is ALIVE node in one hostname.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to add a timestamp to GcsNodeInfo
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm.. I see. So this is due to the case where there could be multiple raylets on a single host. I think this should be the case only when we use cluster_utils
right? Or do you know any other case? I think it kind of doesn't make sense to have multiple ray start in a single node.
If this is useful only for the cluster utils, why don't we just group by hostname + node_id and display both in the dashboard? Is there any issue with this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a node failover many times, there will be a lot node info belongs to one hostname.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can hide the second options' con easily by filtering DEAD node ids to user facing APIs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{
node_id: {
ip:
host:
state:
}
}
And in the frontend, we can just filter DEAD state node id if there are the same hostname + ip pairs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rkooo567 I see. Use node id as the key of node info will returns full nodes info to front-end. And front-end do the filter job. My concern is that the data will be too large if we run many nodes in one cluster. For the DEAD node, the physics node info is useless. If we choose node id
as the key, we need some strategies to reduce the data size:
- Only contains
GcsNodeInfo
data for the DEAD nodes. - Each (IP + hostname) reserves a limited number of DEAD nodes. For example, the last 5 DEAD nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your concerns (since you guys are running long-running clusters, it can have lots of information). I like the second solution, but we can allow users to configure them. So, for regular usage, it is 0~2, and for cluster_utils cases, it can be like 10.
Only contains GcsNodeInfo data for the DEAD nodes.
For this, I am probably not familiar how the data is represented now (because I thought this was the current behavior). Can you explain a little bit more?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current node info is a mixture of node physical stats
(from reporter agent) and node stats
(from GetNodeStats rpc). If a node is dead, the node physical stats
and node stats
will be unreliable, only GcsNodeInfo
is correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, this looks really good. I had a couple questions, and a couple other pieces of minor feedback, but hopefully it'll be quick to address. Thanks for all your hard work on this @fyrestone
Happy to approve once you address or reply to my comments.
@@ -74,6 +76,22 @@ async def _update_nodes(self): | |||
while True: | |||
try: | |||
nodes = await self._get_nodes() | |||
|
|||
# Get correct node info by state, | |||
# 1. The node is ALIVE if any ALIVE node info |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fyrestone I think it could be possible that two machines have the same hostname even if one of them isn't dead. @architkulkarni is currently working on a bug fix for this case in the old dashboard repo.
@@ -74,6 +76,22 @@ async def _update_nodes(self): | |||
while True: | |||
try: | |||
nodes = await self._get_nodes() | |||
|
|||
# Get correct node info by state, | |||
# 1. The node is ALIVE if any ALIVE node info |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it would make sense to use something like an (IP, hostname) pair as a key to uniquely describe a host?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the only concern I have is how we will group nodes. I prefer to make it support cluster_utils because we will have more distributed data collection we'd like to test in the future (for example, we will remove all GetCoreWorkerStats, and replace this with our metrics infra. This is hard to be tested without cluster_utils).
But, I think this is not a hard blocker for this PR. (but I think this should be the next dashboard task).
Maybe another solution is to just refactor our node.py to contain hostname. In that way, we can inject the fake hostname for clutser_utils. |
I agree with you. I can make a PR after we find a solution. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. It seems to me like the test failures are unrelated, and I agree with Sang that the PR shouldn't be blocked by the node grouping problem.
@fyrestone Please resolve the merge conflict and tag |
I have resolved the conflicts, and set the |
Thanks again for this PR @fyrestone! It looks really clean, and I am so happy about testing status. Would you mind pushing a PR for the node grouping next? (if you are planning to work on that)! |
Thanks. I will create the node grouping PR this week. The |
This PR includes 3 basic modules for new dashboard:
reporter.py
, the profiling RESTful API has been simplified to oneGET /api/launch_profiling
.GET /log_index
,GET /logs
from dashboard andGET /logs
from dashboard agent.GET /nodes?view=[summary|hostNameList]
andGET /nodes/{hostname}
Related issue number
Checks
scripts/format.sh
to lint the changes in this PR.