Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core][release] chaos_dataset_shuffle_push_based_sort_1tb failed with WorkerCrashedError #28774

Closed
rickyyx opened this issue Sep 26, 2022 · 3 comments
Assignees
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core P1 Issue that should be fixed within a few weeks

Comments

@rickyyx
Copy link
Contributor

rickyyx commented Sep 26, 2022

What happened + What you expected to happen

chaos_dataset_shuffle_push_based_sort_1tb

(raylet, ip=172.31.66.10) [2022-09-22 16:02:34,214 E 138 138] (raylet) node_manager.cc:2966: System memory low at node with IP 172.31.66.10. Used memory (54.90GB) / total capacity (57.60GB) (0.95307) exceeds threshold 0.95, killing latest task with name  and task ID NIL_ID to avoid running out of memory.
(raylet, ip=172.31.66.10) This may indicate a memory leak in a task or actor, or that too many tasks are running in parallel.
(raylet, ip=172.31.66.10) To find the highest memory consumers, use `ray logs raylet.out -ip 172.31.66.10`.
(raylet, ip=172.31.66.10) Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. To adjust the eviction threshold, set the environment variable `RAY_memory_usage_threshold_fraction` when starting Ray. To disable worker eviction, set the environment variable `RAY_memory_monitor_interval_ms` to zero.

....


ray.exceptions.WorkerCrashedError: The worker died unexpectedly while executing this task. Check python-core-worker-*.log files for more information.

Versions / Dependencies

Last success: 45d7cd2
First Failure: fb7472f

fb7472f Remove RAY_RAYLET_NODE_ID (#28715)
93f911e Add API latency and call counts metrics to dashboard APIs (#28279)
66aae4c [Release Test] Make sure to delete all EBS volumes (#28707)
697df80 [Serve] [Docs] Remove incorrect output (#28708)
d8c9aa7 [docs] configurable ecosystem gallery (#28662)
42874e1 [RLlib] Atari gym environments now require ale-py. (#28703)
b7f0346 [AIR] Maintain dtype info in LightGBMPredictor (#28673)
f6ae7ee [tune] Test background syncer serialization (#28699)
87f22e1 [ci] Requirements contains duplicate of 'starlette' (#28698)
2e7040e [Tune] [PBT] Maintain consistent Trial/TrialRunner state when pausing and resuming trial (#28511)
6530635 [ci] Fix mac pipeline (use python 2 in CI scripts) (#28695)
ee2a8da [ci] Move to new hierarchical docker structure + pipeline (#28641)
a3c97b4 [Doc] Revamp ray core design patterns doc [8/n]: pass large arg by value (#28660)
db2ce69 [Datasets] Add initial aggregate benchmark (#28486)
9c2abf9 [KubeRay][Operator] Improve migration notes (#28672)
45d7cd2 [core] Support generators to allow tasks to return a dynamic number of objects (#28291)

Reproduction script

NA

Issue Severity

No response

@rickyyx rickyyx added bug Something that is supposed to be working; but isn't P0 Issues that should be fixed in short order triage Needs triage (eg: priority, bug/not-bug, and owning component) core Issues that should be addressed in Ray Core labels Sep 26, 2022
@rickyyx rickyyx added this to the Core Nightly/CI Regressions milestone Sep 26, 2022
@rickyyx
Copy link
Contributor Author

rickyyx commented Sep 26, 2022

There are a bunch of open issues but seems to me the root causes might be different:

@jjyao
Copy link
Collaborator

jjyao commented Oct 14, 2022

The last 10 runs all succeeded without this error.

@jjyao jjyao added P1 Issue that should be fixed within a few weeks and removed P0 Issues that should be fixed in short order triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Oct 14, 2022
@hora-anyscale
Copy link
Contributor

Per Triage Sync: Closing, test flaky

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core P1 Issue that should be fixed within a few weeks
Projects
None yet
Development

No branches or pull requests

3 participants