-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Object spilling] Avoid worker crash when an object is spilled right after being restored #15903
[Object spilling] Avoid worker crash when an object is spilled right after being restored #15903
Conversation
Can you verify this works with some custom code manually btw? maybe we can add sleep that takes like 3 seconds -> 2 seconds -> 0 second and see if this works properly. |
The code itself LGTM |
@rkooo567 Manual test results: w/o the fix: w/ the fix: |
@kfstorm just to make sure - did the shuffle eventually finish right? (won't crash means the work was still in progress and it was done, am I correct)? We can test this by decreasing the sleep time for each call (3 -> 2 -> 1 -> 0 second) |
@rkooo567 Oh, the manual test I mentioned above is to run the Python UT. We didn't test sleeping in our cluster. What I'm sure of is that our job finished successfully after the fix. We logged
I guess |
PS: We've tested our Mars job in our cluster w/ and w/o the fix several times and verified that the fix is functional. |
LGTM! |
The flaky test failure seems to be unrelated. |
…d right after being restored (ray-project#15903)" This reverts commit 061e3fb.
It's odd, since the PR build above looks ok for OSX (though it did flake). But I can't find any other proximate PR besides maybe a1375a9 |
I cannot reproduce the failure on my mac with 061e3fb. |
Yeah it passed all tests before I merged it. (same for a1375a9). It is probably from some other factors (and the test failure is timeout, so I guess we should split some of them to other tests?) |
cc @franklsf95 |
Hmm, it seems reverting this didn't help. It could be I reverted the wrong PR, or there was some other environment change. |
(We can probably put it back) |
(We can probably put it back) |
@kfstorm can you create a PR and assign me there? |
Why are these changes needed?
When the object store memory pressure is high, e.g. object store is almost full due to pinned objects, an object which is recently restored may be spilled again in a short time. The existing code in
CoreWorker::PlasmaCallback
involvesContains
andGet
calls. Object spilling may happen between the two calls. So here we can't useRAY_CHECK_OK
on theGet
call. Instead, we should fall back to the object-not-local code path.Related issue number
Closes #15808
Checks
scripts/format.sh
to lint the changes in this PR.