Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix dashboard process reporting on windows #45578

Merged
merged 4 commits into from
Jul 29, 2024
Merged

Conversation

mattip
Copy link
Contributor

@mattip mattip commented May 27, 2024

Why are these changes needed?

Fixes two problems found on windows in process dashboard reporting:

  • PR [Core] Support num_fds metric for components #39790 added a num_fds metric which is not supported on windows
  • Windows runs python via a launcher, which then calls the actual python.exe. See PEP 397. Confusingly, both launcher and the actual process it launches are named python.exe. This caused two problems:
    • When looking for the base raylet process, it is not sufficient to just check pid.parent(), so now the check climbs up the process heirarchy until it finds a process named raylet* or reaches the root.
    • When looking for the workers, it is not sufficient to check raylet.child(), so for each child the check climbs down the process heirarchy to the bottom.

Happy to add a test, or to enable a skipped windows test. Where can I find one?

With these changes, I now see data in the actor panel of the dashboard.

Related issue number

Closes #44604

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@alanwguo
Copy link
Contributor

Nice, @rickyyx @rkooo567 , can you review this one?

For test, can a test be added or updated in test_reporter.py?

Copy link
Contributor

@rickyyx rickyyx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


via GIPHY

workers = {
self._generate_worker_key(proc): proc for proc in raylet_proc.children()
}
workers = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do this for windows only?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. While PEP 397 does mention windows, there is a launcher for other operating systems. The problem found here is subtle and by making it windows-only may trip up users on other platforms.

@mattip
Copy link
Contributor Author

mattip commented Jun 2, 2024

Thanks for the hint. After a036832 running test_reporter.py locally works on linux. The test in test_node_physical_stats is failing on windows. I think the problem is in collecting all the child processes of raylet.exe. On windows, setproctitle cannot change the name of the executable to rayy:IDLE or ray::ACTOR, so the test_node_physical_stats fails. I skipped it.

How do I see if the test_reporter tests are running on CI for Linux and macos, and how do I enable them for windows?

Edit: setproctitile mentions the windows limitation at the end of the README

@rickyyx
Copy link
Contributor

rickyyx commented Jun 4, 2024

cc @anyscalesam - can you help shepherd here among core team?

@anyscalesam anyscalesam requested a review from a team June 11, 2024 23:14
@mattip
Copy link
Contributor Author

mattip commented Jun 19, 2024

@anyscalesam ping

@mattip
Copy link
Contributor Author

mattip commented Jun 21, 2024

It is a shame this missed the 2.30 release.

@jjyao
Copy link
Collaborator

jjyao commented Jul 9, 2024

@mattip part of the issue is fixed by #46329, could you rebase?

@mattip mattip force-pushed the issue44604 branch 2 times, most recently from 10b40c9 to 921d111 Compare July 9, 2024 13:42
@mattip
Copy link
Contributor Author

mattip commented Jul 9, 2024

@jjyao rebased and cleaned up commit history: one commit fixes the problem and the other cleans up tests and skips one that cannot succeed.

# windows, get the child process not the runner
for child in raylet_proc.children():
if child.children():
child = child.children()[0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should do it only for windows, otherwise if a Ray task launches a child process, this will find that process not the Ray task worker process.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, that would happen on windows too. I wonder if there is some better indicator that the process is the one we are looking for. I should add a test for a worker that has subprocess.Popen() in its task.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be fine since we only get child of child and we know the direct child must be a launcher?

self._raylet_proc = curr_proc.parent()
# The dashboard agent is a child of the raylet process.
# It is not necessarily the direct child (python-windows
# typically uses a py.exe runner to run python), so search
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment is incomplete.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Signed-off-by: mattip <[email protected]>
Signed-off-by: Matti Picus <[email protected]>
Signed-off-by: mattip <[email protected]>
Signed-off-by: Matti Picus <[email protected]>
Signed-off-by: Matti Picus <[email protected]>
@mattip
Copy link
Contributor Author

mattip commented Jul 21, 2024

Rebased to clear merge conflicts

@mattip
Copy link
Contributor Author

mattip commented Jul 22, 2024

@jjyao CI is passing

@PhilippWillms
Copy link

@jjyao : Can you merge that, please? Unfortunately, another release was missed, but then let's ensure the solution makes it to the next one.

self._generate_worker_key(proc): proc for proc in raylet_proc.children()
}
workers = {}
if sys.platform == "win32":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we split the proc finding logic vs the dict assignment? something like

            def find_worker_child_proc(proc):
                if sys.platform != "win32":
                    return proc
                # TODO: add comment about the launcher and PEP 397
                # TODO: add some checks here to make sure it's the worker ?
                if child.children():
                    # TODO: assert len(child.children()) == 1 ?
                    return child.children()[0]
                return child
            workers = {
                    self._generate_worker_key(proc): proc
                    for proc in map(find_worker_child_proc, raylet_proc.children())
            }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe that can be done in a follow-on clean-up PR? It has been difficult to get reviewed and approved and I don't want to prolong the process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Dashboard] Ray 2.10: CPU / RAM / GPU usage not correctly displayed on Windows
7 participants