-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Worker crashes on ray.get
while spilling objects
#15808
Comments
ray.get
while spillingray.get
while spilling objects
Btw, do you know how to reproduce it? I assume if we run object spilling workload with async actor, it can be pretty easy to invoke, but it might be great if we can have a script (so that we can have unit tests) |
Ah, I didn't realize that I need to use an async actor. Now I can reproduce it with the community version. Here is the repro script. I just modified @pytest.mark.skipif(
platform.system() == "Windows", reason="Failing on Windows.")
@pytest.mark.asyncio
async def test_spill_during_get(object_spilling_config, shutdown_only):
object_spilling_config, _ = object_spilling_config
address = ray.init(
num_cpus=4,
object_store_memory=100 * 1024 * 1024,
_system_config={
"automatic_object_spilling_enabled": True,
"object_store_full_delay_ms": 100,
"max_io_workers": 1,
"object_spilling_config": object_spilling_config,
"min_spilling_size": 0,
},
)
@ray.remote
class Actor:
async def f(self):
return np.zeros(10 * 1024 * 1024)
a = Actor.remote()
ids = []
for i in range(10):
x = a.f.remote()
print(i, x)
ids.append(x)
# Concurrent gets, which require restoring from external storage, while
# objects are being created.
for x in ids:
print((await x).shape)
assert_no_thrashing(address["redis_address"]) Note that we need to add a sleep statement to easily reproduce the issue. I don't know how to mock this gap in test. diff --git a/src/ray/core_worker/core_worker.cc b/src/ray/core_worker/core_worker.cc
index dcd1135c91..657bdc3338 100644
--- a/src/ray/core_worker/core_worker.cc
+++ b/src/ray/core_worker/core_worker.cc
@@ -2832,6 +2832,7 @@ void CoreWorker::PlasmaCallback(SetResultCallback success,
bool object_is_local = false;
if (Contains(object_id, &object_is_local).ok() && object_is_local) {
std::vector<std::shared_ptr<RayObject>> vec;
+ std::this_thread::sleep_for(std::chrono::seconds(10));
RAY_CHECK_OK(Get(std::vector<ObjectID>{object_id}, 0, &vec));
RAY_CHECK(vec.size() > 0)
<< "Failed to get local object but Raylet notified object is local."; |
Hmm I see. This seems to be pretty hard to test it... maybe just adding async actor-based object spilling + manual verification is the only thing we can do. |
@kfstorm could you take a look at this first? |
@Catch-Bull and I are working on this. |
What is the problem?
Ray version and other system information (Python version, TensorFlow version, OS): Ant internal repo synced to the community commit 6e19fac.
We found an object spilling bug in our internal code repo and I’m wondering if this has been fixed in the community. I suspect
CoreWorker::PlasmaCallback
may crash if spilling happens betweenContains
andGet
.Worker log:
Raylet log greped by b03458cf901098282ca68956c65eedcd9c2d89d43a107fe92b000000:
Worker crashed here:
ray/src/ray/core_worker/core_worker.cc
Line 2439 in 6e19fac
Reproduction (REQUIRED)
Unfortunately I can't find a way to reproduce this issue.
The text was updated successfully, but these errors were encountered: