Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Dashboard] Dashboard basic modules #10303

Merged
merged 23 commits into from
Aug 30, 2020

Conversation

fyrestone
Copy link
Contributor

@fyrestone fyrestone commented Aug 25, 2020

This PR includes 3 basic modules for new dashboard:

  • reporter - Forks reporter.py, the profiling RESTful API has been simplified to one GET /api/launch_profiling.
  • log - Basic log module, provides GET /log_index, GET /logs from dashboard and GET /logs from dashboard agent.
  • stats_collector - Collects stats (currently, actors and node_stats), provides GET /nodes?view=[summary|hostNameList] and GET /nodes/{hostname}

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/latest/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failure rates at https://ray-travis-tracker.herokuapp.com/.
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested (please justify below)

@fyrestone fyrestone added the dashboard Issues specific to the Ray Dashboard label Aug 25, 2020
@fyrestone fyrestone added this to the Turn on new dashboard milestone Aug 25, 2020
@fyrestone fyrestone changed the title [WIP][Dashboard] Dashboard basic modules [Dashboard] Dashboard basic modules Aug 25, 2020
Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It LGTM in general (and nice addition of tests :)), but since the PR is really large, I will have another batch of review later. Next time when you submit a PR, please break them down (this could be broken down to 3 PRs imo).

@@ -74,6 +76,22 @@ async def _update_nodes(self):
while True:
try:
nodes = await self._get_nodes()

# Get correct node info by state,
# 1. The node is ALIVE if any ALIVE node info
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain the reason in the comment too?

Copy link
Contributor Author

@fyrestone fyrestone Aug 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetAllNodeInfo returns all nodes info including dead nodes. We needs to know how to merge them to hostname. For example, there are two nodes info:

  • {node id: 1, hostname: example}
  • {node id: 2, hostname: example}

We choose which one for host example?

For one hostname, there will be a list of node info with only two cases:

  • All nodes info of one hostname are DEAD
  • Only one node info of one hostname is ALIVE

So, here is the rule,

  • Choose a DEAD one if all nodes info of one hostname are DEAD.
  • Choose the ALIVE one if there is ALIVE node in one hostname.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to add a timestamp to GcsNodeInfo.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm.. I see. So this is due to the case where there could be multiple raylets on a single host. I think this should be the case only when we use cluster_utils right? Or do you know any other case? I think it kind of doesn't make sense to have multiple ray start in a single node.

If this is useful only for the cluster utils, why don't we just group by hostname + node_id and display both in the dashboard? Is there any issue with this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a node failover many times, there will be a lot node info belongs to one hostname.

Copy link
Contributor

@rkooo567 rkooo567 Aug 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can hide the second options' con easily by filtering DEAD node ids to user facing APIs.

Copy link
Contributor

@rkooo567 rkooo567 Aug 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{
node_id: {
ip:
host:
state:
}
}

And in the frontend, we can just filter DEAD state node id if there are the same hostname + ip pairs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rkooo567 I see. Use node id as the key of node info will returns full nodes info to front-end. And front-end do the filter job. My concern is that the data will be too large if we run many nodes in one cluster. For the DEAD node, the physics node info is useless. If we choose node id as the key, we need some strategies to reduce the data size:

  • Only contains GcsNodeInfo data for the DEAD nodes.
  • Each (IP + hostname) reserves a limited number of DEAD nodes. For example, the last 5 DEAD nodes.

Copy link
Contributor

@rkooo567 rkooo567 Aug 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your concerns (since you guys are running long-running clusters, it can have lots of information). I like the second solution, but we can allow users to configure them. So, for regular usage, it is 0~2, and for cluster_utils cases, it can be like 10.

Only contains GcsNodeInfo data for the DEAD nodes.

For this, I am probably not familiar how the data is represented now (because I thought this was the current behavior). Can you explain a little bit more?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current node info is a mixture of node physical stats (from reporter agent) and node stats (from GetNodeStats rpc). If a node is dead, the node physical stats and node stats will be unreliable, only GcsNodeInfo is correct.

dashboard/modules/log/log_head.py Show resolved Hide resolved
dashboard/modules/log/log_head.py Outdated Show resolved Hide resolved
dashboard/modules/log/test_log.py Outdated Show resolved Hide resolved
dashboard/modules/reporter/reporter_agent.py Show resolved Hide resolved
dashboard/modules/reporter/reporter_head.py Outdated Show resolved Hide resolved
dashboard/modules/reporter/reporter_head.py Outdated Show resolved Hide resolved
dashboard/modules/reporter/test_reporter.py Outdated Show resolved Hide resolved
dashboard/tests/test_dashboard.py Outdated Show resolved Hide resolved
dashboard/tests/test_dashboard.py Outdated Show resolved Hide resolved
dashboard/tests/test_dashboard.py Outdated Show resolved Hide resolved
Copy link
Contributor

@mfitton mfitton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this looks really good. I had a couple questions, and a couple other pieces of minor feedback, but hopefully it'll be quick to address. Thanks for all your hard work on this @fyrestone
Happy to approve once you address or reply to my comments.

dashboard/agent.py Show resolved Hide resolved
dashboard/dashboard.py Show resolved Hide resolved
dashboard/agent.py Show resolved Hide resolved
@@ -74,6 +76,22 @@ async def _update_nodes(self):
while True:
try:
nodes = await self._get_nodes()

# Get correct node info by state,
# 1. The node is ALIVE if any ALIVE node info
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fyrestone I think it could be possible that two machines have the same hostname even if one of them isn't dead. @architkulkarni is currently working on a bug fix for this case in the old dashboard repo.

@@ -74,6 +76,22 @@ async def _update_nodes(self):
while True:
try:
nodes = await self._get_nodes()

# Get correct node info by state,
# 1. The node is ALIVE if any ALIVE node info
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it would make sense to use something like an (IP, hostname) pair as a key to uniquely describe a host?

dashboard/modules/stats_collector/stats_collector_head.py Outdated Show resolved Hide resolved
dashboard/tests/test_dashboard.py Outdated Show resolved Hide resolved
dashboard/tests/conftest.py Show resolved Hide resolved
dashboard/utils.py Show resolved Hide resolved
dashboard/utils.py Outdated Show resolved Hide resolved
@mfitton mfitton added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Aug 26, 2020
Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the only concern I have is how we will group nodes. I prefer to make it support cluster_utils because we will have more distributed data collection we'd like to test in the future (for example, we will remove all GetCoreWorkerStats, and replace this with our metrics infra. This is hard to be tested without cluster_utils).

But, I think this is not a hard blocker for this PR. (but I think this should be the next dashboard task).

@rkooo567
Copy link
Contributor

Maybe another solution is to just refactor our node.py to contain hostname. In that way, we can inject the fake hostname for clutser_utils.

@fyrestone
Copy link
Contributor Author

I think the only concern I have is how we will group nodes. I prefer to make it support cluster_utils because we will have more distributed data collection we'd like to test in the future (for example, we will remove all GetCoreWorkerStats, and replace this with our metrics infra. This is hard to be tested without cluster_utils).

But, I think this is not a hard blocker for this PR. (but I think this should be the next dashboard task).

I agree with you. I can make a PR after we find a solution.

@mfitton mfitton self-requested a review August 28, 2020 18:15
mfitton
mfitton approved these changes Aug 28, 2020
Copy link
Contributor

@mfitton mfitton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. It seems to me like the test failures are unrelated, and I agree with Sang that the PR shouldn't be blocked by the node grouping problem.

@rkooo567
Copy link
Contributor

@fyrestone Please resolve the merge conflict and tag test_Ok label once build result looks fine. I will merge it after that :)

@fyrestone fyrestone added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. and removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. labels Aug 29, 2020
@fyrestone
Copy link
Contributor Author

@fyrestone Please resolve the merge conflict and tag test_Ok label once build result looks fine. I will merge it after that :)

I have resolved the conflicts, and set the test_Ok label. I am waiting for the test results.

@rkooo567
Copy link
Contributor

Thanks again for this PR @fyrestone! It looks really clean, and I am so happy about testing status. Would you mind pushing a PR for the node grouping next? (if you are planning to work on that)!

@rkooo567 rkooo567 merged commit e9b0463 into ray-project:master Aug 30, 2020
@fyrestone
Copy link
Contributor Author

Thanks again for this PR @fyrestone! It looks really clean, and I am so happy about testing status. Would you mind pushing a PR for the node grouping next? (if you are planning to work on that)!

Thanks. I will create the node grouping PR this week. The GET /nodes will return all nodes info using node id as the key, and GET /node/{hostname} will be replaced by GET /node/{nodeid}.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dashboard Issues specific to the Ray Dashboard tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
3 participants