Object GC for block splitting inside the dataset splitting #26196

jianoaix · 2022-06-29T18:33:56Z

Why are these changes needed?

The pipeline will spill objects when splitting the dataset into multiple equal parts.

Related issue number

#25249

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

ericl · 2022-06-29T19:28:59Z

python/ray/data/dataset.py

@@ -3400,6 +3400,11 @@ def _split(
 left_metadata.append(ray.get(m0))
 right_blocks.append(b1)
 right_metadata.append(ray.get(m1))
+ # If return_right_half is requested, the input block b will be copied
+ # into b0 and b1. In such case, we can safely clear b if this is in
+ # lazy mode.


Hmm isn't it copied in both cases?

At the level of this method, the right half may not copied and conditioned on this boolean: https://sourcegraph.com/github.com/ray-project/ray@master/-/blob/python/ray/data/dataset.py?L3689

But the left half is always copied. And the right half is either copied or None. It seems good to clear in either case.

ericl · 2022-06-29T19:30:04Z

python/ray/data/dataset.py

+ # into b0 and b1. In such case, we can safely clear b if this is in
+ # lazy mode.
+ if return_right_half and self._lazy:
+ ray._private.internal_api.free(b, local_only=False)


Should we more generally auto-free on any transform in lazy mode? It feels a little ad-hoc right now, like we should improve our block handling abstractions. Though we can defer this to future work.

Yes, we should do it more cleanly. We may need to have an overhaul of execution semantics, e.g. 1) we execute the plan when it's split(), regardless of whether it's lazy or not; 2) we need to have the clear semantics of eager v.s. cached/pinned dataset (via running ds.fully_executed()), the later will make the dataset like an eager one, but running fully_executed() isn't going to make a lazy dataset eager.

clarkzinzow

Since this is force-clearing on the base blocks of the dataset, we should be careful that this won't break certain fan-out patterns. E.g.:

ds = ray.data.from_items(list(range(1000))).experimental_lazy()
ds.limit(10).show()
ds.show()

ds2 = ray.data.from_items(list(range(1000))).experimental_lazy()
dses = ds.split_at_indices([250, 750])
for ds in dses:
    ds.show()
ds2.show()

The first should work since we're only clearing if return_right_half=True (although agreed with @ericl that that shouldn't matter), but I think the second might fail.

jianoaix · 2022-07-01T22:21:47Z

I looked more into this regarding safety. It can happen actually quite easily (not even need to fanout or use from_items()). For example, this will crash:

ds1 = ray.data.range(100, parallelism=10).map_batches(lambda x: x).experimental_lazy() # map_batches() can be dropped
ds1_splits = split(3, equal=True)
for ds in ds1_splits:
    for batch in ds.iter_batches():
        pass
for batch in ds1.iter_batches():
    pass

It's generally more tricky to do eager object gc for dataset than pipeline, as the later is protected by the API semantics that it can be consumed once at most.

We might still be able to do eager gc for lazy dataset, if we move to a semantics like "each time to execute a lazy dataset, it will be executed from the beginning" ("beginning" would mean, for example, the source input blocks, or some intermediate dataset which is known to be pinned and cannot be cleared before unpinned, the output of eager dataset). This will need a revamp work (filed #26264) first (at least part of it).

One hack we may try is to flag the Dataset (with an extra parameter) whether it's used in DatasetPipeline processing, and if so we clear the input block after split.

jianoaix · 2022-07-07T03:05:27Z

@ericl @clarkzinzow This should be ready to review.

clarkzinzow

LGTM!

…ay-project#26196)" This reverts commit 45ba0e3.

…26196)" (#26495) This reverts commit 45ba0e3. Failures in the Train GPU job started popping up involving lost references around when this PR was merged; there was an ongoing failure that was reverted that overlaps this PR, but this PR is the most likely culprit for this particular lost reference issue, so we should try reverting the PR. - Flakey test tracker: https://flakey-tests.ray.io/ - Example failure: https://buildkite.com/ray-project/ray-builders-branch/builds/8585#0181f423-0fe2-42b5-9dd8-47d2c7f9efa7

…tting (ray-project#26196)" (ray-project#26495)" This reverts commit 12ea100.

…uns the plan (#26650) Having the indicator about who's running the stage and who created a blocklist will enable the eager memory releasing. This is an alternative with better abstraction to #26196. Note: this doesn't work for Dataset.split() yet, will do in a followup PR.

…uns the plan (ray-project#26650) Having the indicator about who's running the stage and who created a blocklist will enable the eager memory releasing. This is an alternative with better abstraction to ray-project#26196. Note: this doesn't work for Dataset.split() yet, will do in a followup PR. Signed-off-by: Rohan138 <[email protected]>

…ct#26196) The pipeline will spill objects when splitting the dataset into multiple equal parts. Co-authored-by: Ubuntu <[email protected]> Signed-off-by: Stefan van der Kleij <[email protected]>

…ay-project#26196)" (ray-project#26495) This reverts commit 45ba0e3. Failures in the Train GPU job started popping up involving lost references around when this PR was merged; there was an ongoing failure that was reverted that overlaps this PR, but this PR is the most likely culprit for this particular lost reference issue, so we should try reverting the PR. - Flakey test tracker: https://flakey-tests.ray.io/ - Example failure: https://buildkite.com/ray-project/ray-builders-branch/builds/8585#0181f423-0fe2-42b5-9dd8-47d2c7f9efa7 Signed-off-by: Stefan van der Kleij <[email protected]>

…uns the plan (ray-project#26650) Having the indicator about who's running the stage and who created a blocklist will enable the eager memory releasing. This is an alternative with better abstraction to ray-project#26196. Note: this doesn't work for Dataset.split() yet, will do in a followup PR. Signed-off-by: Stefan van der Kleij <[email protected]>

Object GC for block splitting insdie the dataset splitting

1586c42

jianoaix assigned ericl and clarkzinzow Jun 29, 2022

jianoaix requested review from ericl, scv119, clarkzinzow and jjyao as code owners June 29, 2022 18:33

fix

da73320

jianoaix changed the title ~~Object GC for block splitting insdie the dataset splitting~~ Object GC for block splitting inside the dataset splitting Jun 29, 2022

ericl reviewed Jun 29, 2022

View reviewed changes

clarkzinzow reviewed Jun 29, 2022

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jun 30, 2022

Ubuntu added 2 commits July 1, 2022 22:23

Merge branch 'master' of https://github.com/ray-project/ray into pygcv3

5cac7e1

flag dataset used in pipeline

5f4d140

clarkzinzow approved these changes Jul 12, 2022

View reviewed changes

ericl merged commit 45ba0e3 into ray-project:master Jul 12, 2022

jianoaix pushed a commit to jianoaix/ray that referenced this pull request Jul 12, 2022

Add test for the previous PR ray-project#26196

fb31d4a

clarkzinzow added a commit to clarkzinzow/ray that referenced this pull request Jul 13, 2022

Revert "Object GC for block splitting inside the dataset splitting (r…

3060342

…ay-project#26196)" This reverts commit 45ba0e3.

brucez-anyscale mentioned this pull request Jul 14, 2022

Fix test dashboard flaky by catch an expected exception #26555

Merged

6 tasks

jianoaix pushed a commit to jianoaix/ray that referenced this pull request Jul 14, 2022

Revert "Revert "Object GC for block splitting inside the dataset spli…

d4e0bdb

…tting (ray-project#26196)" (ray-project#26495)" This reverts commit 12ea100.

jianoaix mentioned this pull request Jul 14, 2022

Revert "Revert "Object GC for block splitting inside the dataset spli… #26583

Closed

6 tasks

jianoaix mentioned this pull request Jul 17, 2022

Make execution plan/blocklist aware of the memory ownership and who runs the plan #26650

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Object GC for block splitting inside the dataset splitting #26196

Object GC for block splitting inside the dataset splitting #26196

jianoaix commented Jun 29, 2022 •

edited

Loading

ericl Jun 29, 2022

jianoaix Jun 29, 2022

ericl Jun 29, 2022

ericl Jun 29, 2022

jianoaix Jun 29, 2022

clarkzinzow left a comment •

edited

Loading

jianoaix commented Jul 1, 2022

jianoaix commented Jul 7, 2022

clarkzinzow left a comment

Object GC for block splitting inside the dataset splitting #26196

Object GC for block splitting inside the dataset splitting #26196

Conversation

jianoaix commented Jun 29, 2022 • edited Loading

Why are these changes needed?

Related issue number

Checks

ericl Jun 29, 2022

Choose a reason for hiding this comment

jianoaix Jun 29, 2022

Choose a reason for hiding this comment

ericl Jun 29, 2022

Choose a reason for hiding this comment

ericl Jun 29, 2022

Choose a reason for hiding this comment

jianoaix Jun 29, 2022

Choose a reason for hiding this comment

clarkzinzow left a comment • edited Loading

Choose a reason for hiding this comment

jianoaix commented Jul 1, 2022

jianoaix commented Jul 7, 2022

clarkzinzow left a comment

Choose a reason for hiding this comment

jianoaix commented Jun 29, 2022 •

edited

Loading

clarkzinzow left a comment •

edited

Loading