Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Object spilling] Avoid worker crash when an object is spilled right after being restored #15903

Merged
merged 3 commits into from
May 21, 2021

Conversation

kfstorm
Copy link
Member

@kfstorm kfstorm commented May 19, 2021

Why are these changes needed?

When the object store memory pressure is high, e.g. object store is almost full due to pinned objects, an object which is recently restored may be spilled again in a short time. The existing code in CoreWorker::PlasmaCallback involves Contains and Get calls. Object spilling may happen between the two calls. So here we can't use RAY_CHECK_OK on the Get call. Instead, we should fall back to the object-not-local code path.

Related issue number

Closes #15808

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@rkooo567
Copy link
Contributor

Can you verify this works with some custom code manually btw? maybe we can add sleep that takes like 3 seconds -> 2 seconds -> 0 second and see if this works properly.

@rkooo567
Copy link
Contributor

The code itself LGTM

@kfstorm
Copy link
Member Author

kfstorm commented May 20, 2021

@rkooo567 Manual test results:

w/o the fix:
sleep 3s: very likely to crash
sleep 2s: very likely to crash
sleep 1s: very likely to crash
sleep 0s: no crash

w/ the fix:
sleep 3s: no crash
sleep 2s: no crash
sleep 1s: no crash
sleep 0s: no crash

@rkooo567
Copy link
Contributor

@kfstorm just to make sure - did the shuffle eventually finish right? (won't crash means the work was still in progress and it was done, am I correct)? We can test this by decreasing the sleep time for each call (3 -> 2 -> 1 -> 0 second)

@kfstorm
Copy link
Member Author

kfstorm commented May 20, 2021

@rkooo567 Oh, the manual test I mentioned above is to run the Python UT. We didn't test sleeping in our cluster. What I'm sure of is that our job finished successfully after the fix.

We logged CoreWorker::MemoryUsageString in our cluster and this is a typical log when the if condition in PlasmaCallback cannot pass:

num clients with quota: 0
quota map size: 0
pinned quota map size: 0
allocated bytes: 1200002042
allocation limit: 1579372544
pinned bytes: 3600004861
(global lru) capacity: 1579372544
(global lru) used: 0%
(global lru) num objects: 0
(global lru) num evictions: 1790
(global lru) bytes evicted: 158000458336

I guess (global lru) num objects: 0 means that all objects in plasma store are pinned, hence not in LRU, right?

@kfstorm
Copy link
Member Author

kfstorm commented May 20, 2021

PS: We've tested our Mars job in our cluster w/ and w/o the fix several times and verified that the fix is functional.

@kfstorm kfstorm changed the title Avoid worker crash when an object is spilled right after being restored [Object spilling] Avoid worker crash when an object is spilled right after being restored May 20, 2021
@rkooo567
Copy link
Contributor

LGTM!

@rkooo567
Copy link
Contributor

The flaky test failure seems to be unrelated.

@rkooo567 rkooo567 merged commit 061e3fb into ray-project:master May 21, 2021
@ericl
Copy link
Contributor

ericl commented May 21, 2021

image

I'm not certain but I think this PR is causing OSX test_object_spilling failures.

ericl added a commit to ericl/ray that referenced this pull request May 21, 2021
@ericl
Copy link
Contributor

ericl commented May 21, 2021

It's odd, since the PR build above looks ok for OSX (though it did flake). But I can't find any other proximate PR besides maybe a1375a9

@kfstorm kfstorm deleted the fix_crash_in_plasma_callback branch May 21, 2021 06:42
@kfstorm
Copy link
Member Author

kfstorm commented May 21, 2021

I cannot reproduce the failure on my mac with 061e3fb.

@rkooo567
Copy link
Contributor

Yeah it passed all tests before I merged it. (same for a1375a9). It is probably from some other factors (and the test failure is timeout, so I guess we should split some of them to other tests?)

@rkooo567
Copy link
Contributor

cc @franklsf95

@ericl
Copy link
Contributor

ericl commented May 21, 2021

Hmm, it seems reverting this didn't help. It could be I reverted the wrong PR, or there was some other environment change.

@ericl
Copy link
Contributor

ericl commented May 21, 2021

(We can probably put it back)

@ericl
Copy link
Contributor

ericl commented May 21, 2021

(We can probably put it back)

@rkooo567
Copy link
Contributor

@kfstorm can you create a PR and assign me there?

@kfstorm
Copy link
Member Author

kfstorm commented May 24, 2021

@rkooo567 #16012

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Worker crashes on ray.get while spilling objects
4 participants