Revert "Revert "Object GC for block splitting inside the dataset spli… #26583

jianoaix · 2022-07-14T22:03:24Z

Why are these changes needed?

Fix the original PR #26196

The issue is we have a in-place transformation (randomize block order) in the pipeline, so we need to make sure it's handled safely.

Related issue number

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

…tting (ray-project#26196)" (ray-project#26495)" This reverts commit 12ea100.

ericl · 2022-07-15T18:04:19Z

If I understand correctly, the issue is something like ray.data.range(10).randomize_block_order().window() ends up clearing the original blocks right? How about we just have randomize_block_order stage set created_by_pipeline=False for its output Dataset then? It seems that avoids this whole API change and tracking of in-place index.

In other words, the core issue is that created_by_pipeline is not set correctly for the output of the randomize blocks stage. The blocks were not actually created by the pipeline.

jianoaix · 2022-07-15T20:30:59Z

The issue is like ray.data.range(10).window().randomize_block_order() (it should be fine for randomize_block_order() placed before window() since we execute it before making it a pipeline). I aded unit tests to make the issue more clear in this commit: 2f792c0.

Yes, the cause is that blocks are not created by pipeline. I also thought about using stage or blocklist to indicate what we need to know, which is nicer but the stages in DatasetPipeline is not using the Stage that's used in Dataset plan. We probably need some refactor here.

ericl

I see. Then I think we should fix this properly then, by propagating the flag via blocklist or some other way. It seems much cleaner and is a sign that the current created_by_pipeline is not at the right layer of abstraction.

jianoaix · 2022-07-17T22:47:07Z

@ericl @clarkzinzow Here is a draft PR (#26650) which structures the code differently. It doesn't fully work yet, because we have out-of-band transformation (those not captured by the stages/execution plan) like split().

jianoaix · 2022-07-22T21:46:56Z

This PR should be superseded by #26902 and #26650.

Ubuntu added 2 commits July 13, 2022 16:58

Revert "Revert "Object GC for block splitting inside the dataset spli…

d4e0bdb

…tting (ray-project#26196)" (ray-project#26495)" This reverts commit 12ea100.

Make gc safe

7060013

jianoaix requested review from ericl, scv119, clarkzinzow and jjyao as code owners July 14, 2022 22:03

jianoaix added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 14, 2022

ericl self-assigned this Jul 15, 2022

test it

2f792c0

ericl requested changes Jul 15, 2022

View reviewed changes

jianoaix mentioned this pull request Jul 16, 2022

[Datasets] Refactor split_at_indices() to minimize number of split tasks and data movement. #26363

Merged

8 tasks

jianoaix closed this Jul 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revert "Revert "Object GC for block splitting inside the dataset spli… #26583

Revert "Revert "Object GC for block splitting inside the dataset spli… #26583

jianoaix commented Jul 14, 2022 •

edited

Loading

ericl commented Jul 15, 2022

jianoaix commented Jul 15, 2022

ericl left a comment

jianoaix commented Jul 17, 2022 •

edited

Loading

jianoaix commented Jul 22, 2022

Revert "Revert "Object GC for block splitting inside the dataset spli… #26583

Revert "Revert "Object GC for block splitting inside the dataset spli… #26583

Conversation

jianoaix commented Jul 14, 2022 • edited Loading

Why are these changes needed?

Related issue number

Checks

ericl commented Jul 15, 2022

jianoaix commented Jul 15, 2022

ericl left a comment

Choose a reason for hiding this comment

jianoaix commented Jul 17, 2022 • edited Loading

jianoaix commented Jul 22, 2022

jianoaix commented Jul 14, 2022 •

edited

Loading

jianoaix commented Jul 17, 2022 •

edited

Loading