Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert "Revert "Object GC for block splitting inside the dataset spli… #26583

Closed
wants to merge 3 commits into from

Conversation

jianoaix
Copy link
Contributor

@jianoaix jianoaix commented Jul 14, 2022

Why are these changes needed?

Fix the original PR #26196

The issue is we have a in-place transformation (randomize block order) in the pipeline, so we need to make sure it's handled safely.

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@jianoaix jianoaix added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 14, 2022
@ericl
Copy link
Contributor

ericl commented Jul 15, 2022

If I understand correctly, the issue is something like ray.data.range(10).randomize_block_order().window() ends up clearing the original blocks right? How about we just have randomize_block_order stage set created_by_pipeline=False for its output Dataset then? It seems that avoids this whole API change and tracking of in-place index.

In other words, the core issue is that created_by_pipeline is not set correctly for the output of the randomize blocks stage. The blocks were not actually created by the pipeline.

@ericl ericl self-assigned this Jul 15, 2022
@jianoaix
Copy link
Contributor Author

The issue is like ray.data.range(10).window().randomize_block_order() (it should be fine for randomize_block_order() placed before window() since we execute it before making it a pipeline). I aded unit tests to make the issue more clear in this commit: 2f792c0.

Yes, the cause is that blocks are not created by pipeline. I also thought about using stage or blocklist to indicate what we need to know, which is nicer but the stages in DatasetPipeline is not using the Stage that's used in Dataset plan. We probably need some refactor here.

Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Then I think we should fix this properly then, by propagating the flag via blocklist or some other way. It seems much cleaner and is a sign that the current created_by_pipeline is not at the right layer of abstraction.

@jianoaix
Copy link
Contributor Author

jianoaix commented Jul 17, 2022

@ericl @clarkzinzow Here is a draft PR (#26650) which structures the code differently. It doesn't fully work yet, because we have out-of-band transformation (those not captured by the stages/execution plan) like split().

@jianoaix
Copy link
Contributor Author

This PR should be superseded by #26902 and #26650.

@jianoaix jianoaix closed this Jul 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants