[core] Support generators to allow tasks to return a dynamic number of objects #28291

stephanie-wang · 2022-09-06T01:11:42Z

Why are these changes needed?

This adds support for tasks that need to return a dynamic number of objects. When a remote generator function is invoked and num_returns for the task is 1, the worker will dynamically allocate ray.put IDs for these objects and store an ObjectRefGenerator as its return value. This allows the worker to choose how many objects to return and to keep heap memory low, since it does not need to keep all objects in memory simultaneously.

Unlike normal ray.put(), we assign the task caller as the owner of the object. This is to improve fault tolerance, as the owner can recover dynamically generated objects through the normal lineage reconstruction codepath.

The main complication has to do with notifying the task caller that it owns these objects. We do this in two places, which is necessary because the protocols are asynchronous, so either message can arrive first.

When the task reply is received.
When the primary raylet subscribes to the eviction notice from the owner.
To register the dynamic return, the owner adds the ObjectRef to the ref counter and marks that it is contained in the generator object.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

TODO:

docs

Signed-off-by: Stephanie Wang <[email protected]>

- nondeterministic recovery test - ref counting bug? Signed-off-by: Stephanie Wang <[email protected]>

stephanie-wang · 2022-09-06T01:12:36Z

Still have some TODOs, but the main changes are ready for review.

Signed-off-by: Stephanie Wang <[email protected]>

clarkzinzow

Sweet, general approach LGTM but I'll defer to others for a more thorough review.

In addition to the object eviction subscription, is there a potential race between the pinned object location update and the push task reply for the dynamic return objects? IIRC both of those RPCs are async with no synchronization barrier in-between. Maybe this is a non-issue because we always add the pinned object location when processing the push task reply?

ray/src/ray/core_worker/task_manager.cc

Lines 249 to 252 in 775ff3e

 // NOTE(swang): We need to add the location of the object before marking 

 // it as local in the in-memory store so that the data locality policy 

 // will choose the right raylet for any queued dependent tasks. 

 reference_counter_->UpdateObjectPinnedAtRaylet(object_id, worker_raylet_id);

src/ray/protobuf/node_manager.proto

src/ray/protobuf/pubsub.proto

clarkzinzow · 2022-09-06T16:32:07Z

src/ray/core_worker/core_worker.cc

+ // eviction events before we know about the object. This can happen when we
+ // receive the subscription request before the reply from the task that
+ // created the object. Add the dynamically created object to our ref
+ // counter so that we know that it exists.


Nice, I was wondering if this race was going to be covered.

clarkzinzow · 2022-09-06T16:52:46Z

src/ray/core_worker/core_worker.cc

+ for (const auto &return_id : return_ids) {
+ RAY_LOG(DEBUG) << "Task " << task_spec.TaskId() << " will return object "
+ << return_id;
+ }


Nit: This return ID iteration + debug logging could be moved to the loop directly above this one.

clarkzinzow · 2022-09-06T16:55:49Z

src/ray/core_worker/core_worker.h

@@ -343,6 +343,7 @@ class CoreWorker : public rpc::CoreWorkerServiceHandler {
 /// \return Status.
 Status SealExisting(const ObjectID &object_id,
 bool pin_object,
+ const ObjectID &generator_id = ObjectID::Nil(),


generator_id should be added to the docstring for this method and SealReturnObject, PinExistingReturnObject.

stephanie-wang · 2022-09-06T17:29:34Z

Sweet, general approach LGTM but I'll defer to others for a more thorough review.

In addition to the object eviction subscription, is there a potential race between the pinned object location update and the push task reply for the dynamic return objects? IIRC both of those RPCs are async with no synchronization barrier in-between. Maybe this is a non-issue because we always add the pinned object location when processing the push task reply?

ray/src/ray/core_worker/task_manager.cc

Lines 249 to 252 in 775ff3e

// NOTE(swang): We need to add the location of the object before marking

// it as local in the in-memory store so that the data locality policy

// will choose the right raylet for any queued dependent tasks.

reference_counter_->UpdateObjectPinnedAtRaylet(object_id, worker_raylet_id);

Hmm I think this part is okay because of what you said: it all happens locally at the task caller when processing the push task reply.

I think there may be a race condition for object directory location updates, though, let me look into this.

Co-authored-by: Clark Zinzow <[email protected]>

Signed-off-by: Stephanie Wang <[email protected]>

stephanie-wang · 2022-09-07T00:15:55Z

Sweet, general approach LGTM but I'll defer to others for a more thorough review.
In addition to the object eviction subscription, is there a potential race between the pinned object location update and the push task reply for the dynamic return objects? IIRC both of those RPCs are async with no synchronization barrier in-between. Maybe this is a non-issue because we always add the pinned object location when processing the push task reply?

ray/src/ray/core_worker/task_manager.cc

Lines 249 to 252 in 775ff3e

// NOTE(swang): We need to add the location of the object before marking

// it as local in the in-memory store so that the data locality policy

// will choose the right raylet for any queued dependent tasks.

reference_counter_->UpdateObjectPinnedAtRaylet(object_id, worker_raylet_id);

Hmm I think this part is okay because of what you said: it all happens locally at the task caller when processing the push task reply.

I think there may be a race condition for object directory location updates, though, let me look into this.

Okay, I believe I've resolved this issue by also attaching the generator ID to spill location updates. I don't think it's necessary for in-memory locations (see added comments).

ericl

Just tried this, very cool! A couple API comments:

When you specify num_returns, it seems that you can also yield the exact same number of values, but it is not a generator return. This seems confusing, should we disallow mixing num_returns and generators?
Should we define __len__ on the generator return object? It seems not unreasonable to include this, and even if we decide in the future to support streaming generators, we could just raise an error trying to get the length in that case.

python/ray/_raylet.pyx

python/ray/tests/test_generators.py

python/ray/_raylet.pyx

ericl

We should also add docs to this I think.

Signed-off-by: Stephanie Wang <[email protected]>

jjyao · 2022-09-19T04:25:47Z

Ping me when it's ready for more reviews :)

stephanie-wang · 2022-09-19T16:35:35Z

Ping me when it's ready for more reviews :)

I think all the comments were addressed already, trying to fix CI now.

pcmoritz

Thanks, LGTM! Before merging, let's please mark the num_returns="dynamic" API as experimental.

Before we declare it stable, in particular I'm curious why the num_returns="dynamic" option doesn't return a generator of ObjectRefs (from the API perspective that seems more natural for me). Are there implementation limitations for this?

jjyao

My last few comments :)

python/ray/_private/worker.py

python/ray/_raylet.pyx

jjyao · 2022-09-21T15:41:18Z

python/ray/_raylet.pyx

+ # number of objects as before.
+ num_returns = returns[0].size()
+ else:
+ # This is the first execution of the task, so we don't know how


This doesn't necessarily mean it's the first execution of the task? It can also mean the generator is empty?

Are we able to catch this case: the first execution returns an empty generator but re-execution returns a non-empty generator.

Hmm good catch, let me check. We'll probably have to resolve this case as a follow-up.

Actually it works since we don't reconstruct empty ObjectRefGenerators. Added a test.

jjyao · 2022-09-21T15:45:22Z

python/ray/_raylet.pyx

 raise ValueError(
 "Task returned more than num_returns={} objects.".format(
- n_returns))
+ num_returns))
+ while i >= returns[0].size():


What about this? Is it possible that this loop will be executed more than once?

jjyao · 2022-09-21T15:56:01Z

src/ray/core_worker/core_worker.h

 /// \param[in] owner_address Address of the owner of the object who will be contacted by
 /// the raylet if the object is pinned. If not provided, defaults to this worker.
 /// \return Status.
 Status SealExisting(const ObjectID &object_id,
 bool pin_object,
+ const ObjectID &generator_id = ObjectID::Nil(),


How do we decide when to use which style? What's the guideline we should follow in the future?

stephanie-wang · 2022-09-21T16:10:16Z

Thanks, LGTM! Before merging, let's please mark the num_returns="dynamic" API as experimental.

Before we declare it stable, in particular I'm curious why the num_returns="dynamic" option doesn't return a generator of ObjectRefs (from the API perspective that seems more natural for me). Are there implementation limitations for this?

The main reason right now was to avoid complicating ray.wait. Here are the pros/cons for returning a generator directly:

pros:

more consistent with the usual semantics of num_returns=x (directly return a generator of refs instead of an objectref containing a generator of refs). this would also make it simpler to swap between code that's using a static vs dynamic num_returns, since the caller code doesn't need to insert an extra ray.get
more explicit that the function is returning a generator instead of a normal value

cons:

accessing the generator for the first time will implicitly block until we know how many refs to return. might be a gotcha for users.
need to extend ray.wait to support ObjectRefGenerators in addition to ObjectRefs, whereas before you could just wait on the returned ObjectRef. i don't think other APIs are affected

Signed-off-by: Stephanie Wang <[email protected]>

stephanie-wang · 2022-09-21T21:59:29Z

All new tests passing.

stephanie-wang added 4 commits September 3, 2022 18:49

Worker side

5636eb9

Signed-off-by: Stephanie Wang <[email protected]>

Owner side, works except for when spilling?

3457a46

Signed-off-by: Stephanie Wang <[email protected]>

now it works for spilling/in-plasma objects

031640b

Signed-off-by: Stephanie Wang <[email protected]>

recovery test. TODO:

e59dd65

- nondeterministic recovery test - ref counting bug? Signed-off-by: Stephanie Wang <[email protected]>

stephanie-wang requested review from wuisawesome, ericl, AmeerHajAli, robertnishihara, pcmoritz, raulchen, fishbone and scv119 as code owners September 6, 2022 01:11

stephanie-wang added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Sep 6, 2022

stephanie-wang assigned ericl, scv119 and jjyao Sep 6, 2022

Sort of fix nondeterminism

775ff3e

Signed-off-by: Stephanie Wang <[email protected]>

clarkzinzow reviewed Sep 6, 2022

View reviewed changes

stephanie-wang and others added 5 commits September 6, 2022 15:39

Update src/ray/protobuf/node_manager.proto

b316b9f

Co-authored-by: Clark Zinzow <[email protected]>

Update src/ray/protobuf/pubsub.proto

2d0f5d4

Co-authored-by: Clark Zinzow <[email protected]>

C++

faf64cb

Signed-off-by: Stephanie Wang <[email protected]>

doc

5d92940

Signed-off-by: Stephanie Wang <[email protected]>

fixes

7c69b32

Signed-off-by: Stephanie Wang <[email protected]>

stephanie-wang removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Sep 7, 2022

ericl approved these changes Sep 8, 2022

View reviewed changes

python/ray/_raylet.pyx Outdated Show resolved Hide resolved

python/ray/tests/test_generators.py Outdated Show resolved Hide resolved

python/ray/_raylet.pyx Outdated Show resolved Hide resolved

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Sep 8, 2022

ericl reviewed Sep 8, 2022

View reviewed changes

stephanie-wang added 7 commits September 13, 2022 22:44

x

805a088

Signed-off-by: Stephanie Wang <[email protected]>

cpp

3360d63

Signed-off-by: Stephanie Wang <[email protected]>

fix

2ed252a

Signed-off-by: Stephanie Wang <[email protected]>

x

3a811dc

Signed-off-by: Stephanie Wang <[email protected]>

x

285d98d

Signed-off-by: Stephanie Wang <[email protected]>

Merge remote-tracking branch 'upstream/master' into generators-forreal

7a86f19

options

b045d34

Signed-off-by: Stephanie Wang <[email protected]>

x

7b3cf27

pcmoritz approved these changes Sep 21, 2022

View reviewed changes

jjyao reviewed Sep 21, 2022

View reviewed changes

experimental

92a8491

jjyao mentioned this pull request Sep 21, 2022

Let driver own pcollections ray-project/ray_beam_runner#41

Merged

stephanie-wang added 2 commits September 21, 2022 12:26

experimental

492b95e

Signed-off-by: Stephanie Wang <[email protected]>

x

07ca73d

Signed-off-by: Stephanie Wang <[email protected]>

jjyao approved these changes Sep 21, 2022

View reviewed changes

fix

c196c67

Signed-off-by: Stephanie Wang <[email protected]>

stephanie-wang merged commit 45d7cd2 into ray-project:master Sep 21, 2022

stephanie-wang deleted the generators-forreal branch September 21, 2022 21:59

stephanie-wang mentioned this pull request Sep 28, 2022

[core] Dynamic generators that error return partial ObjectRefs followed by exception ObjectRef #28864

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core] Support generators to allow tasks to return a dynamic number of objects #28291

[core] Support generators to allow tasks to return a dynamic number of objects #28291

stephanie-wang commented Sep 6, 2022 •

edited

Loading

stephanie-wang commented Sep 6, 2022

clarkzinzow left a comment •

edited

Loading

clarkzinzow Sep 6, 2022

clarkzinzow Sep 6, 2022

clarkzinzow Sep 6, 2022

stephanie-wang commented Sep 6, 2022

stephanie-wang commented Sep 7, 2022

ericl left a comment

ericl left a comment

jjyao commented Sep 19, 2022

stephanie-wang commented Sep 19, 2022

pcmoritz left a comment

jjyao left a comment

jjyao Sep 21, 2022

stephanie-wang Sep 21, 2022

stephanie-wang Sep 21, 2022

jjyao Sep 21, 2022

jjyao Sep 21, 2022

stephanie-wang commented Sep 21, 2022 •

edited

Loading

stephanie-wang commented Sep 21, 2022

	// NOTE(swang): We need to add the location of the object before marking
	// it as local in the in-memory store so that the data locality policy
	// will choose the right raylet for any queued dependent tasks.
	reference_counter_->UpdateObjectPinnedAtRaylet(object_id, worker_raylet_id);

[core] Support generators to allow tasks to return a dynamic number of objects #28291

[core] Support generators to allow tasks to return a dynamic number of objects #28291

Conversation

stephanie-wang commented Sep 6, 2022 • edited Loading

Why are these changes needed?

Related issue number

Checks

stephanie-wang commented Sep 6, 2022

clarkzinzow left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stephanie-wang commented Sep 6, 2022

stephanie-wang commented Sep 7, 2022

ericl left a comment

Choose a reason for hiding this comment

ericl left a comment

Choose a reason for hiding this comment

jjyao commented Sep 19, 2022

stephanie-wang commented Sep 19, 2022

pcmoritz left a comment

Choose a reason for hiding this comment

jjyao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stephanie-wang commented Sep 21, 2022 • edited Loading

stephanie-wang commented Sep 21, 2022

stephanie-wang commented Sep 6, 2022 •

edited

Loading

clarkzinzow left a comment •

edited

Loading

stephanie-wang commented Sep 21, 2022 •

edited

Loading