Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] node crash due to agent failure #37731

Closed
scv119 opened this issue Jul 24, 2023 · 1 comment · Fixed by #37953
Closed

[Core] node crash due to agent failure #37731

scv119 opened this issue Jul 24, 2023 · 1 comment · Fixed by #37953
Assignees
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core P0 Issues that should be fixed in short order

Comments

@scv119
Copy link
Contributor

scv119 commented Jul 24, 2023

What happened + What you expected to happen

what we noticed is that node start failing due to agent failure like following

2023-07-21 02:20:52,503	ERROR reporter_agent.py:1112 -- Error publishing node physical stats.
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/ray/thirdparty_files/psutil/_pslinux.py", line 1653, in wrapper
    return fun(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/ray/thirdparty_files/psutil/_common.py", line 480, in wrapper
    raise raise_from(err, None)
  File "<string>", line 3, in raise_from
  File "/usr/local/lib/python3.8/dist-packages/ray/thirdparty_files/psutil/_common.py", line 478, in wrapper
    return fun(self)
  File "/usr/local/lib/python3.8/dist-packages/ray/thirdparty_files/psutil/_pslinux.py", line 1695, in _parse_stat_file
    data = bcat("%s/%s/stat" % (self._procfs_path, self.pid))
  File "/usr/local/lib/python3.8/dist-packages/ray/thirdparty_files/psutil/_common.py", line 813, in bcat
    return cat(fname, fallback=fallback, _open=open_binary)
  File "/usr/local/lib/python3.8/dist-packages/ray/thirdparty_files/psutil/_common.py", line 801, in cat
    with _open(fname) as f:
  File "/usr/local/lib/python3.8/dist-packages/ray/thirdparty_files/psutil/_common.py", line 765, in open_binary
    return open(fname, "rb", buffering=FILE_READ_BUFFER_SIZE)
FileNotFoundError: [Errno 2] No such file or directory: '/proc/1214155/stat'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 1095, in _perform_iteration
    stats = self._get_all_stats()
  File "/usr/local/lib/python3.8/dist-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 626, in _get_all_stats
    "workers": self._get_workers(),
  File "/usr/local/lib/python3.8/dist-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 512, in _get_workers
    return [
  File "/usr/local/lib/python3.8/dist-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 525, in <listcomp>
    if w.status() != psutil.STATUS_ZOMBIE
  File "/usr/local/lib/python3.8/dist-packages/ray/thirdparty_files/psutil/__init__.py", line 691, in status
    return self._proc.status()
  File "/usr/local/lib/python3.8/dist-packages/ray/thirdparty_files/psutil/_pslinux.py", line 1653, in wrapper
    return fun(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/ray/thirdparty_files/psutil/_pslinux.py", line 2187, in status
    letter = self._parse_stat_file()['status']
  File "/usr/local/lib/python3.8/dist-packages/ray/thirdparty_files/psutil/_pslinux.py", line 1660, in wrapper
    raise NoSuchProcess(self.pid, self._name)
psutil.NoSuchProcess: process no longer exists (pid=1214155)

Versions / Dependencies

latest

Reproduction script

n/a

Issue Severity

None

@scv119 scv119 added bug Something that is supposed to be working; but isn't P0 Issues that should be fixed in short order triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jul 24, 2023
@scv119 scv119 self-assigned this Jul 24, 2023
@scv119
Copy link
Contributor Author

scv119 commented Jul 24, 2023

@scv119 scv119 added core Issues that should be addressed in Ray Core and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jul 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core P0 Issues that should be fixed in short order
Projects
None yet
1 participant