[Object spilling] Avoid worker crash when an object is spilled right after being restored #15903

kfstorm · 2021-05-19T13:24:46Z

Why are these changes needed?

When the object store memory pressure is high, e.g. object store is almost full due to pinned objects, an object which is recently restored may be spilled again in a short time. The existing code in CoreWorker::PlasmaCallback involves Contains and Get calls. Object spilling may happen between the two calls. So here we can't use RAY_CHECK_OK on the Get call. Instead, we should fall back to the object-not-local code path.

Related issue number

Closes #15808

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

rkooo567 · 2021-05-19T16:43:37Z

Can you verify this works with some custom code manually btw? maybe we can add sleep that takes like 3 seconds -> 2 seconds -> 0 second and see if this works properly.

rkooo567 · 2021-05-19T16:43:49Z

The code itself LGTM

kfstorm · 2021-05-20T07:41:38Z

@rkooo567 Manual test results:

w/o the fix:
sleep 3s: very likely to crash
sleep 2s: very likely to crash
sleep 1s: very likely to crash
sleep 0s: no crash

w/ the fix:
sleep 3s: no crash
sleep 2s: no crash
sleep 1s: no crash
sleep 0s: no crash

rkooo567 · 2021-05-20T08:38:04Z

@kfstorm just to make sure - did the shuffle eventually finish right? (won't crash means the work was still in progress and it was done, am I correct)? We can test this by decreasing the sleep time for each call (3 -> 2 -> 1 -> 0 second)

kfstorm · 2021-05-20T08:47:37Z

@rkooo567 Oh, the manual test I mentioned above is to run the Python UT. We didn't test sleeping in our cluster. What I'm sure of is that our job finished successfully after the fix.

We logged CoreWorker::MemoryUsageString in our cluster and this is a typical log when the if condition in PlasmaCallback cannot pass:

num clients with quota: 0
quota map size: 0
pinned quota map size: 0
allocated bytes: 1200002042
allocation limit: 1579372544
pinned bytes: 3600004861
(global lru) capacity: 1579372544
(global lru) used: 0%
(global lru) num objects: 0
(global lru) num evictions: 1790
(global lru) bytes evicted: 158000458336

I guess (global lru) num objects: 0 means that all objects in plasma store are pinned, hence not in LRU, right?

kfstorm · 2021-05-20T08:49:04Z

PS: We've tested our Mars job in our cluster w/ and w/o the fix several times and verified that the fix is functional.

rkooo567 · 2021-05-20T18:46:15Z

LGTM!

rkooo567 · 2021-05-21T01:36:08Z

The flaky test failure seems to be unrelated.

ericl · 2021-05-21T04:15:18Z

I'm not certain but I think this PR is causing OSX test_object_spilling failures.

…d right after being restored (ray-project#15903)" This reverts commit 061e3fb.

ericl · 2021-05-21T04:21:51Z

It's odd, since the PR build above looks ok for OSX (though it did flake). But I can't find any other proximate PR besides maybe a1375a9

kfstorm · 2021-05-21T07:44:51Z

I cannot reproduce the failure on my mac with 061e3fb.

rkooo567 · 2021-05-21T17:08:12Z

Yeah it passed all tests before I merged it. (same for a1375a9). It is probably from some other factors (and the test failure is timeout, so I guess we should split some of them to other tests?)

rkooo567 · 2021-05-21T17:08:22Z

cc @franklsf95

ericl · 2021-05-21T18:40:47Z

Hmm, it seems reverting this didn't help. It could be I reverted the wrong PR, or there was some other environment change.

ericl · 2021-05-21T18:40:56Z

(We can probably put it back)

ericl · 2021-05-21T18:41:01Z

(We can probably put it back)

rkooo567 · 2021-05-21T19:56:04Z

@kfstorm can you create a PR and assign me there?

kfstorm · 2021-05-24T04:19:26Z

@rkooo567 #16012

kfstorm added 2 commits May 19, 2021 20:49

Fix check failure when memory pressure is high

c4af649

Add test

3cebaf7

kfstorm requested review from rkooo567, simon-mo and ericl May 19, 2021 13:25

simon-mo requested a review from stephanie-wang May 19, 2021 16:24

rkooo567 assigned stephanie-wang and rkooo567 May 19, 2021

lint

0315438

rkooo567 approved these changes May 20, 2021

View reviewed changes

kfstorm changed the title ~~Avoid worker crash when an object is spilled right after being restored~~ [Object spilling] Avoid worker crash when an object is spilled right after being restored May 20, 2021

rkooo567 merged commit 061e3fb into ray-project:master May 21, 2021

ericl added a commit to ericl/ray that referenced this pull request May 21, 2021

Revert "[Object spilling] Avoid worker crash when an object is spille…

dbd5800

…d right after being restored (ray-project#15903)" This reverts commit 061e3fb.

ericl mentioned this pull request May 21, 2021

Revert "[Object spilling] Avoid worker crash when an object is spille… #15964

Merged

6 tasks

kfstorm deleted the fix_crash_in_plasma_callback branch May 21, 2021 06:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Object spilling] Avoid worker crash when an object is spilled right after being restored #15903

[Object spilling] Avoid worker crash when an object is spilled right after being restored #15903

kfstorm commented May 19, 2021

rkooo567 commented May 19, 2021

rkooo567 commented May 19, 2021

kfstorm commented May 20, 2021

rkooo567 commented May 20, 2021

kfstorm commented May 20, 2021

kfstorm commented May 20, 2021 •

edited

Loading

rkooo567 commented May 20, 2021

rkooo567 commented May 21, 2021

ericl commented May 21, 2021

ericl commented May 21, 2021

kfstorm commented May 21, 2021

rkooo567 commented May 21, 2021

rkooo567 commented May 21, 2021

ericl commented May 21, 2021

ericl commented May 21, 2021

ericl commented May 21, 2021

rkooo567 commented May 21, 2021

kfstorm commented May 24, 2021

[Object spilling] Avoid worker crash when an object is spilled right after being restored #15903

[Object spilling] Avoid worker crash when an object is spilled right after being restored #15903

Conversation

kfstorm commented May 19, 2021

Why are these changes needed?

Related issue number

Checks

rkooo567 commented May 19, 2021

rkooo567 commented May 19, 2021

kfstorm commented May 20, 2021

rkooo567 commented May 20, 2021

kfstorm commented May 20, 2021

kfstorm commented May 20, 2021 • edited Loading

rkooo567 commented May 20, 2021

rkooo567 commented May 21, 2021

ericl commented May 21, 2021

ericl commented May 21, 2021

kfstorm commented May 21, 2021

rkooo567 commented May 21, 2021

rkooo567 commented May 21, 2021

ericl commented May 21, 2021

ericl commented May 21, 2021

ericl commented May 21, 2021

rkooo567 commented May 21, 2021

kfstorm commented May 24, 2021

kfstorm commented May 20, 2021 •

edited

Loading