Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[data] release test failure : pipelined_ingestion_1500_gb #33846

Closed
clarng opened this issue Mar 29, 2023 · 9 comments · Fixed by #34030
Closed

[data] release test failure : pipelined_ingestion_1500_gb #33846

clarng opened this issue Mar 29, 2023 · 9 comments · Fixed by #34030
Assignees
Labels
bug Something that is supposed to be working; but isn't P0 Issues that should be fixed in short order Ray 2.4 release-blocker P0 Issue that blocks the release

Comments

@clarng
Copy link
Contributor

clarng commented Mar 29, 2023

What happened + What you expected to happen

looks like it timed out

[ERROR 2023-03-28 19:02:48,645] run_release_test.py: 164 Command timed out after 9600.080104424998 seconds.

  | Traceback (most recent call last):
  | File "ray_release/scripts/run_release_test.py", line 160, in main
  | no_terminate=no_terminate,
  | File "/tmp/release-a8qe7WenMw/release/ray_release/glue.py", line 488, in run_release_test
  | raise pipeline_exception
  | File "/tmp/release-a8qe7WenMw/release/ray_release/glue.py", line 372, in run_release_test
  | raise e
  | File "/tmp/release-a8qe7WenMw/release/ray_release/glue.py", line 364, in run_release_test
  | raise_on_timeout=not is_long_running,
  | File "/tmp/release-a8qe7WenMw/release/ray_release/command_runner/anyscale_job_runner.py", line 269, in run_command
  | job_status_code, error, raise_on_timeout=raise_on_timeout
  | File "/tmp/release-a8qe7WenMw/release/ray_release/command_runner/anyscale_job_runner.py", line 174, in _handle_command_output
  | f"Command timed out after {workload_time_taken} seconds."
  | ray_release.exception.TestCommandTimeout: Command timed out after 9600.080104424998 seconds.

Versions / Dependencies

master

Reproduction script

https://buildkite.com/ray-project/release-tests-branch/builds/1493#01872a7e-b392-4a66-98ea-21091dc3636f

Issue Severity

None

@clarng clarng added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Mar 29, 2023
@clarng clarng added release-blocker P0 Issue that blocks the release P0 Issues that should be fixed in short order Ray 2.4 labels Mar 29, 2023
@scottjlee scottjlee removed the triage Needs triage (eg: priority, bug/not-bug, and owning component) label Mar 31, 2023
@jianoaix
Copy link
Contributor

jianoaix commented Apr 3, 2023

I was able to reproduce it with full scale (915 files) of this test: https://console.anyscale-staging.com/o/anyscale-internal/workspaces/expwrk_c29icr8mng8ts8u1d2dagg8tt6/ses_kf2dra2s6xzdti2in5svlx3ldp?command-history-section=command_history

I think the issue is likely because the consumers were dead:

======== List: 2023-04-03 14:32:53.396982 ========
Stats:
------------------------------
Total: 27

Table:
------------------------------
    NODE_ID                                                   NODE_IP      STATE    NODE_NAME    RESOURCES_TOTAL
 0  0dafd86422abf30258b2a7b31af7e2dda0c642d1b2a275a619e19cda  10.0.13.195  DEAD     10.0.13.195  GPU: 4.0
                                                                                                 memory: 164925865575.0
                                                                                                 node:10.0.13.195: 1.0
                                                                                                 object_store_memory: 70682513817.0
 1  15e4552fc25c0bd085bd5cf76f67d368bcd6d986dd2d1062120265e5  10.0.4.207   ALIVE    10.0.4.207   GPU: 4.0
                                                                                                 memory: 164925498573.0
                                                                                                 node:10.0.4.207: 1.0
                                                                                                 object_store_memory: 70682356531.0
 2  2775fb7aa0957a68aad17fcfa0c1fda27d54f655fd06dbf49178a90c  10.0.9.167   ALIVE    10.0.9.167   CPU: 32.0
                                                                                                 memory: 164925779559.0
                                                                                                 node:10.0.9.167: 1.0
                                                                                                 object_store_memory: 70682476953.0
 3  2bcafd1a70d8d63870764da9459d6737e59409133c992b57821ee194  10.0.46.7    ALIVE    10.0.46.7    GPU: 4.0
                                                                                                 memory: 164925808231.0
                                                                                                 node:10.0.46.7: 1.0
                                                                                                 object_store_memory: 70682489241.0
 4  3338bea7149d3b7608a85c1ee4060da8b23ff401702bbb561ac9a895  10.0.37.84   DEAD     10.0.37.84   GPU: 4.0
                                                                                                 memory: 164925871309.0
                                                                                                 node:10.0.37.84: 1.0
                                                                                                 object_store_memory: 70682516275.0
 5  3863f2671e973c27da933a558553209934883c97b37cc1a938a2a47d  10.0.25.230  ALIVE    10.0.25.230  CPU: 32.0
                                                                                                 memory: 164925963060.0
                                                                                                 node:10.0.25.230: 1.0
                                                                                                 object_store_memory: 70682555596.0

Taking 10.0.13.195 for example, it died because of OOM:

worker.py:1983 -- Raylet is terminated: ip=10.0.13.195, id=0dafd86422abf30258b2a7b31af7e2dda0c642d1b2a275a619e19cda. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system OOM killer, (2) Invalid memory access from Raylet causing SIGSEGV or SIGBUS, (3) Other termination signals. Last 20 lines of the Raylet logs:█████████████████████████████████████████████████████████████████████████| 80/80 [03:07<00:00,  1.79it/s]
    [2023-04-03 14:16:54,777 C 166 166] (raylet) node_manager.cc:2151:  Check failed: worker ███████████████████████████████████████████████████████▌ | 79/80 [03:07<00:10, 10.54s/it]
    *** StackTrace Information ***0%|                                                                                                                          | 0/80 [03:07<?, ?it/s]
    /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x52b5fa) [0x55de5e49f5fa] ray::operator<<()
    /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x52cfe2) [0x55de5e4a0fe2] ray::SpdLogMessage::Flush()
    /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x52d2f7) [0x55de5e4a12f7] ray::RayLog::~RayLog()
    /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x28df6e) [0x55de5e201f6e] ray::raylet::NodeManager::AsyncResolveObjects()
    /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x2906bf) [0x55de5e2046bf] ray::raylet::NodeManager::ProcessFetchOrReconstructMessage()
    /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x29f2a2) [0x55de5e2132a2] ray::raylet::NodeManager::ProcessClientMessage()
    /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x1e9871) [0x55de5e15d871] std::_Function_handler<>::_M_invoke()
    /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x4c20ed) [0x55de5e4360ed] ray::ClientConnection::ProcessMessage()
    /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x50a596) [0x55de5e47e596] EventTracker::RecordExecution()
    /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x4b9442) [0x55de5e42d442] boost::asio::detail::binder2<>::operator()()
    /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x4b9b98) [0x55de5e42db98] boost::asio::detail::reactive_socket_recv_op<>::do_complete()
    /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0xa89ebb) [0x55de5e9fdebb] boost::asio::detail::scheduler::do_run_one()
    /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0xa8c449) [0x55de5ea00449] boost::asio::detail::scheduler::run()
    /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0xa8c902) [0x55de5ea00902] boost::asio::io_context::run()
    /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x16675a) [0x55de5e0da75a] main
    /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7efeffcaa083] __libc_start_main
    /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet(+0x1adb57) [0x55de5e121b57]

Ray was able to bring back the consumers, but they are then out of sync with the rest consumers, so DatasetPipeline got stuck and then timeout.

@jianoaix
Copy link
Contributor

jianoaix commented Apr 3, 2023

The two nodes running consumers were using high amount of heap (object store is quite emtpy):
Screen Shot 2023-04-03 at 2 42 03 PM

@jianoaix
Copy link
Contributor

jianoaix commented Apr 3, 2023

Reproduced in bulk mode, which also saw high memory usage and node got killed.

@jianoaix
Copy link
Contributor

jianoaix commented Apr 3, 2023

One hypothesis is the new dataset iterator doesn't do eager object GC

@jianoaix
Copy link
Contributor

jianoaix commented Apr 3, 2023

Verified that if using DatasetPipeline.iter_batches(), the consumption nodes' memory is much lighter and healthy:

Screen Shot 2023-04-03 at 4 01 24 PM

@jianoaix
Copy link
Contributor

jianoaix commented Apr 4, 2023

This fix had a successful run: https://buildkite.com/ray-project/release-tests-pr/builds/33701

@clarng
Copy link
Contributor Author

clarng commented Apr 6, 2023

@jianoaix I see this is closed, can we open a cherry pick since this is a release blocker?

@jianoaix
Copy link
Contributor

jianoaix commented Apr 6, 2023

@clarng Yep, ptal for this PR: #34141

@clarng
Copy link
Contributor Author

clarng commented Apr 7, 2023

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't P0 Issues that should be fixed in short order Ray 2.4 release-blocker P0 Issue that blocks the release
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants