Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[UI] [Nodes] Disable actions for dead nodes #34387

Merged
merged 9 commits into from
Apr 25, 2023

Conversation

chaowanggg
Copy link
Contributor

@chaowanggg chaowanggg commented Apr 14, 2023

Why are these changes needed?

We are going to remove logs for dead nodes and remove Stack Trace and CPU Flame Graph for dead workers.
Otherwise, users will click the link and jump to a page without information they need.

Alive Nodes Dead Nodes
image image

Summary of Changes

  1. hide logs for dead nodes in NodeRow.tsx

Discussion and Confusion

As part of the PR, we have also come to an agreement, which I will list here for future reference.

  1. There are three entities on the Cluster page: node, raylet, and worker.
  2. We have a state field for node and raylet, but we do not have the state of the worker returned from the backend API.
  3. In Ray, a node is equal to a raylet.

Related issue number

fixes #32593

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: chaowang <[email protected]>
@chaowanggg chaowanggg force-pushed the cw/disable_dead_nodes_actions branch from 25f2abc to 50aebb8 Compare April 14, 2023 00:15
@chaowanggg
Copy link
Contributor Author

needs more discussion with @scottsun94 about how to deal with dead workers logs

@chaowanggg chaowanggg marked this pull request as ready for review April 14, 2023 00:17
Copy link
Contributor

@alanwguo alanwguo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For all UI changes, please add a screenshot to the PR description.

dashboard/client/src/pages/node/NodeRow.tsx Outdated Show resolved Hide resolved
CPU&nbsp;Flame&nbsp;Graph
</a>
<br />
{/* )} */}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

dashboard/client/src/pages/node/NodeRow.component.test.tsx Outdated Show resolved Hide resolved
@chaowanggg
Copy link
Contributor Author

chaowanggg commented Apr 18, 2023

@alanwguo I am curious why do we hardcode the status of a worker as "ALIVE"?

For this task, I need to hide CPU profiling & stack trace for DEAD workers and I think I could judge whether it is dead by using "is_alive" prop from the API

<StatusChip type="worker" status="ALIVE" />

@alanwguo
Copy link
Contributor

@alanwguo I am curious why do we hardcode the status of a worker as "ALIVE"?

For this task, I need to hide CPU profiling & stack trace for DEAD workers and I think I could judge whether it is dead by using "is_alive" prop from the API

<StatusChip type="worker" status="ALIVE" />

Hm. I'm not sure. I think at one point we only showed alive Nodes in the list view but that is no longer the case. Please remove the hard-coding.

@chaowanggg
Copy link
Contributor Author

@alanwguo I am curious why do we hardcode the status of a worker as "ALIVE"?
For this task, I need to hide CPU profiling & stack trace for DEAD workers and I think I could judge whether it is dead by using "is_alive" prop from the API

<StatusChip type="worker" status="ALIVE" />

Hm. I'm not sure. I think at one point we only showed alive Nodes in the list view but that is no longer the case. Please remove the hard-coding.

Thanks Alan. I just wanted to double-confirm that the backend logic has filtered the workers and is only returning "ALIVE" workers to the front end. This is because I need to display different UI for workers that are "ALIVE" versus those that are "DEAD".

Could you please let me know whom I should contact to confirm the backend logic, or provide me with a code pointer that I can refer to?

@alanwguo
Copy link
Contributor

Please check get_all_nodes in node_head.py

@chaowanggg
Copy link
Contributor Author

@alanwguo In this PR, I need to demonstrate distinct UIs for dead workers and alive workers. However, I'm unable to determine the state of a worker(not node) using the available API, as it doesn't include a "state" or "is_alive" field

image

@chaowanggg
Copy link
Contributor Author

Please check get_all_nodes in node_head.py

Double confirmed, no filtering in get_all_nodes function

@alanwguo
Copy link
Contributor

@alanwguo In this PR, I need to demonstrate distinct UIs for dead workers and alive workers. However, I'm unable to determine the state of a worker(not node) using the available API, as it doesn't include a "state" or "is_alive" field

image

I think all workers are guaranteed to be alive. We don't show information about dead workers.

Can we apply this change only to nodes for this PR?

@chaowanggg
Copy link
Contributor Author

@scottsun94 All workers are guaranteed to be alive and we don't show information about dead workers.

According to this figma. Do you think we could skip the UI changes for workers?

@scottsun94
Copy link
Contributor

@scottsun94 All workers are guaranteed to be alive and we don't show information about dead workers.

According to this figma. Do you think we could skip the UI changes for workers?

SG. Let's skip it for this PR.

However, is there any reason why we don't expose/have the info about the dead workers? @rkooo567
Let's say if a worker died because of OOMs, and usually the error message asks people to look at the worker logs. If a dead worker is gone, they have to go to the logs UI to find the logs. With the new worker log change, it's really hard for people to find it any more...

@chaowanggg chaowanggg changed the title [UI] [Nodes] Disable actions for dead nodes & dead workers [UI] [Nodes] Disable actions for dead nodes Apr 19, 2023
@chaowanggg chaowanggg marked this pull request as draft April 19, 2023 16:33
@@ -94,7 +95,9 @@ export const NodeRow = ({
</Box>
</TableCell>
<TableCell>
<Link to={`/logs/${encodeURIComponent(logUrl)}`}>Log</Link>
Copy link
Contributor Author

@chaowanggg chaowanggg Apr 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to get the state of a node to decide whether to show logs or hide logs. But I noticed that in line 78, we are using raylet.state.

<TableCell>
        <StatusChip type="node" status={raylet.state} />
</TableCell>

Thus, we have two state here, node's state and raylet's state.

Why do we need to show raylet's state here? Do I need to hide or show the logs based on raylet's state?
@alanwguo

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rkooo567 can you explain the difference between these states?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

node == raylet in Ray

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides, we can refactor these API responses from the dashboard at some point..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea!

@chaowanggg chaowanggg marked this pull request as ready for review April 19, 2023 20:07
@@ -94,7 +95,9 @@ export const NodeRow = ({
</Box>
</TableCell>
<TableCell>
<Link to={`/logs/${encodeURIComponent(logUrl)}`}>Log</Link>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides, we can refactor these API responses from the dashboard at some point..

@rkooo567
Copy link
Contributor

oops sorry ignore my aproval ^

Signed-off-by: chaowang <[email protected]>
Signed-off-by: chaowang <[email protected]>
Signed-off-by: chaowang <[email protected]>
Signed-off-by: chaowang <[email protected]>
Signed-off-by: chaowang <[email protected]>
Signed-off-by: chaowang <[email protected]>
Signed-off-by: chaowang <[email protected]>
@rkooo567 rkooo567 merged commit 1ce7303 into ray-project:master Apr 25, 2023
ProjectsByJackHe pushed a commit to ProjectsByJackHe/ray that referenced this pull request May 4, 2023
We are going to remove logs for dead nodes and remove Stack Trace and CPU Flame Graph for dead workers.
Otherwise, users will click the link and jump to a page without information they need.

Signed-off-by: Jack He <[email protected]>
architkulkarni pushed a commit to architkulkarni/ray that referenced this pull request May 16, 2023
We are going to remove logs for dead nodes and remove Stack Trace and CPU Flame Graph for dead workers.
Otherwise, users will click the link and jump to a page without information they need.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[dashboard] Logs button shows for dead nodes but doesn't work
4 participants