You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If a generator creates 100 object refs, then object lost, then it rerun and yields only 50. A caller waiting for the latter 50 objects hang until ObjectFetchTimedOutError.
E ray.exceptions.RayTaskError(ObjectFetchTimedOutError): ray::consumes() (pid=87331, ip=10.0.0.180)
E File "/Users/ruiyangwang/gits/ray/python/ray/tests/test_data_chaos.py", line 185, in consumes
E nums = ray.get(objs)
E ray.exceptions.ObjectFetchTimedOutError: Failed to retrieve object 16310a0f0a45af5cffffffffffffffffffffffff0100000034000000. To see information about where this ObjectRef was created in Python, set the environment variable RAY_record_ref_creation_sites=1 during `ray start` and `ray.init()`.
E
E Fetch for object 16310a0f0a45af5cffffffffffffffffffffffff0100000034000000 timed out because no locations were found for the object. This may indicate a system-level bug.
Versions / Dependencies
master
Reproduction script
importosimportsysimportrayimportnumpyasnpimportpytest@pytest.fixturedefshort_timeout(monkeypatch):
monkeypatch.setenv("RAY_fetch_fail_timeout_milliseconds", "1000")
yielddeftest_f(short_timeout, ray_start_cluster):
""" Tests nondeterministic generators vs lineage reconstruction. Timeline: 1. In worker node, creates a generator that generates 100 objects 2. Kills worker node, objs exist in ref, but data lost 3. In worker node, creates a consumer that consumes 100 objects 4. Start a worker node to enable the task and lineage reconstruction 5. Lineage reconstruction should be working here. Make the gen to only generate 50. 5. Verify that the consumer task can still run (it's not) """cluster=ray_start_clustercluster.add_node(num_cpus=1, resources={"head": 1})
cluster.wait_for_nodes()
ray.init(address=cluster.address)
@ray.remote(num_cpus=0, resources={"head": 0.1})classValueHolder:
def__init__(self, val):
self.value=valdefset(self, val):
self.value=valdefget(self):
returnself.value@ray.remote(num_cpus=1, resources={"worker": 1})defgenerates(value_holder):
num=ray.get(value_holder.get.remote())
print(f"generates {num}")
foriinrange(num):
print(f"generating {i}")
yieldnp.ones((1000, 1000), dtype=np.uint8) *iprint(f"generated {num}")
@ray.remote(num_cpus=1, resources={"worker": 1})defconsumes(objs, expected_num):
nums=ray.get(objs) # Time out now!!!# E ray.exceptions.RayTaskError(ObjectFetchTimedOutError): ray::consumes() (pid=87331, ip=10.0.0.180)# E File "/Users/ruiyangwang/gits/ray/python/ray/tests/test_data_chaos.py", line 185, in consumes# E nums = ray.get(objs)# E ray.exceptions.ObjectFetchTimedOutError: Failed to retrieve object 16310a0f0a45af5cffffffffffffffffffffffff0100000034000000. To see information about where this ObjectRef was created in Python, set the environment variable RAY_record_ref_creation_sites=1 during `ray start` and `ray.init()`.# E # E Fetch for object 16310a0f0a45af5cffffffffffffffffffffffff0100000034000000 timed out because no locations were found for the object. This may indicate a system-level bug.assertlen(nums) ==expected_numprint(f"consumes {len(nums)}")
print(nums)
returnexpected_numworker_node=cluster.add_node(num_cpus=10, resources={"worker": 10})
cluster.wait_for_nodes()
holder=ValueHolder.remote(100)
gen=ray.get(generates.remote(holder))
objs=list(gen)
assertlen(objs) ==100# kill the worker nodecluster.remove_node(worker_node, allow_graceful=False)
# Make sure gen only generates 50 now...ray.get(holder.set.remote(50))
# ... but a consumer takes all 100consumer=consumes.remote(objs, 100)
# start a new worker nodeworker_node=cluster.add_node(num_cpus=10, resources={"worker": 10})
cluster.wait_for_nodes()
ray.get(consumer)
if__name__=="__main__":
importpytestifos.environ.get("PARALLEL_CI"):
sys.exit(pytest.main(["-n", "auto", "--boxed", "-vs", __file__]))
else:
sys.exit(pytest.main(["-sv", __file__]))
Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered:
rynewang
added
bug
Something that is supposed to be working; but isn't
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
labels
Jul 3, 2024
What happened + What you expected to happen
If a generator creates 100 object refs, then object lost, then it rerun and yields only 50. A caller waiting for the latter 50 objects hang until ObjectFetchTimedOutError.
Versions / Dependencies
master
Reproduction script
Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered: