Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] Add support for shuffling input files #40154

Merged
merged 5 commits into from
Oct 13, 2023
Merged

Conversation

c21
Copy link
Contributor

@c21 c21 commented Oct 5, 2023

Why are these changes needed?

This PR is to add support for shuffling input files ordering, for all file-based data sources. The interface for controlling the behavior is through shuffle argument in all read APIs for file-based data sources:

# Enable input files shuffling with default seed
ds = ray.data.read_parquet(..., shuffle="files")
ds = ray.data.read_images(..., shuffle="files")

Several different interfaces considered but not chosen:

  1. Add to DataContext. It has drawback that DataContext config controlling the subtle semantics difference for operators, which could introduce bugs later, and not consistent with rest of APIs.

  2. Optimizer rule to push down randomize_block_order to data source: read_xxx().randomize_block_order(). This has the benefit of not introducing any new interface. But it has drawback: (1).randomize_block_order() API is exposing block concept, and hard for users to understand and use based on feedback in the past. (2).Pushdown can only happen when randomize_block_order() applied immediately after read_xxx(), and it's not safe to push down if there's more operation in between: read_xxx().map_batches().randomize_block_order(). This behavior will be hard for users to understand, and will cause issue when user code gets more complicated.

  3. Introduce new argument into random_shuffle(file=True), and optimizer rule to push down to data source: read_xxx().random_shuffle(file=True). It has drawback similar to above 2.(2)., and I get a hard time to choose the name of new argument. file is not a good name here, because random_shuffle() should not care about whether data is coming form file or not (it's part of data source concepts). and I don't want to name it random_shuffle(block=True), so this exposes block concept, and make it even more confusing given we already have randomize_block_order().

Note:
The seed would be always using the default one, and not supported to change from users. After 2.8, we shall iterate on how to expose the seed option for users. One option is to have a Dataset.manual/set_seed() API to control the global seed of random number generator, but it's a bit too early to introduce now without users feedback.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@@ -284,6 +284,7 @@ def _get_block_metadata(
if (
prefetched_metadata is not None
and len(prefetched_metadata) == num_fragments
and all(m is not None for m in prefetched_metadata)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed for Parquet datasource, as previously it depends on Parquet datasource to just return a empty list if metadata is unknown.

@stephanie-wang
Copy link
Contributor

LGTM! A couple questions:

  • Should we use shuffle_files=True instead of shuffle="files"?
  • Should we update the train.DataConfig to set this automatically? (in a separate PR)

def test_random_shuffle(self, ray_start_regular_shared):
# NOTE: set preserve_order to True to allow consistent output behavior.
context = ray.data.DataContext.get_current()
preserve_order = context.execution_options.preserve_order
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, define a context manager in conftest.py for enabling preserve_order. That would be clearer and reusable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also have the restore_data_context fixture.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 let's use a pytest fixture here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated to use restore_data_context, thanks folks!

file_paths == sorted(output_paths)
for output_paths in output_paths_list
]
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also test read_parquet, as it has a different implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, added

python/ray/data/read_api.py Show resolved Hide resolved
python/ray/data/datasource/file_based_datasource.py Outdated Show resolved Hide resolved
def test_random_shuffle(self, ray_start_regular_shared):
# NOTE: set preserve_order to True to allow consistent output behavior.
context = ray.data.DataContext.get_current()
preserve_order = context.execution_options.preserve_order
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 let's use a pytest fixture here

python/ray/data/tests/test_image.py Outdated Show resolved Hide resolved
python/ray/data/tests/test_image.py Outdated Show resolved Hide resolved
python/ray/data/read_api.py Show resolved Hide resolved
python/ray/data/datasource/file_based_datasource.py Outdated Show resolved Hide resolved
The file paths and their sizes after shuffling.
"""
raise NotImplementedError
class FileMetadataShuffler:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this abstractions adds an unnecessary layer on indirection. It essential just wraps a single function. I think it'd be simpler if did something like this in FileBasedDatasource:

if shuffle == "files":
    metadata = np.random.shuffle(metadata)

If we introduce different shuffling methods in the future, we can always revisit this an introduce a new abstraction. But at this point, I think it's premature

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was also thinking about it. I wanted to save some code duplication, but it looks like there's no much duplication now. Remove this class for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the FileMetadataShuffler class is still here (?)

python/ray/data/datasource/parquet_datasource.py Outdated Show resolved Hide resolved
python/ray/data/read_api.py Show resolved Hide resolved
@c21
Copy link
Contributor Author

c21 commented Oct 12, 2023

Should we use shuffle_files=True instead of shuffle="files"?

@stephanie-wang, according to discussion offline earlier, shuffle_files=True has the limitation which cannot be extended later, if we need to support shuffle seed, or different granularity of shuffle, we have to add more arguments separately. On the other hand a single shuffle argument we can overload the data type later, e.g. to support a ShuffleOption class via shuffle=ShuffleOption(seed=..., ...)

Should we update the train.DataConfig to set this automatically? (in a separate PR)

That's good question, probably not in 2.8. Need more discussion with Ray Train together. Would prefer to introduce on Data first, and gather users feedback.

@c21
Copy link
Contributor Author

c21 commented Oct 12, 2023

All comments should be addressed, PTAL thanks! cc @raulchen, @stephanie-wang and @bveeramani.

Copy link
Member

@bveeramani bveeramani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we still need to remove file_metadata_shuffler.py (?), but other than that LGTM

The file paths and their sizes after shuffling.
"""
raise NotImplementedError
class FileMetadataShuffler:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the FileMetadataShuffler class is still here (?)

@c21
Copy link
Contributor Author

c21 commented Oct 12, 2023

I think we still need to remove file_metadata_shuffler.py (?), but other than that LGTM

Let me do the code removal in a separate PR. Several places need to be cleaned up, such as file_metadata_shuffler.py, https://github.com/ray-project/ray/blob/master/python/ray/data/_default_config.py and https://github.com/ray-project/ray/blob/master/python/ray/data/context.py#L174 .

@bveeramani
Copy link
Member

I think we still need to remove file_metadata_shuffler.py (?), but other than that LGTM

Let me do the code removal in a separate PR. Several places need to be cleaned up, such as file_metadata_shuffler.py, https://github.com/ray-project/ray/blob/master/python/ray/data/_default_config.py and https://github.com/ray-project/ray/blob/master/python/ray/data/context.py#L174 .

Ah, gotcha. Sounds good

@stephanie-wang
Copy link
Contributor

Should we use shuffle_files=True instead of shuffle="files"?

@stephanie-wang, according to discussion offline earlier, shuffle_files=True has the limitation which cannot be extended later, if we need to support shuffle seed, or different granularity of shuffle, we have to add more arguments separately. On the other hand a single shuffle argument we can overload the data type later, e.g. to support a ShuffleOption class via shuffle=ShuffleOption(seed=..., ...)

Should we update the train.DataConfig to set this automatically? (in a separate PR)

That's good question, probably not in 2.8. Need more discussion with Ray Train together. Would prefer to introduce on Data first, and gather users feedback.

Sounds good, thanks for the context.

Signed-off-by: Cheng Su <[email protected]>
Signed-off-by: Cheng Su <[email protected]>
@c21 c21 merged commit ba6ae3e into ray-project:master Oct 13, 2023
27 of 40 checks passed
@c21 c21 deleted the shuffle branch October 13, 2023 18:08
@c21 c21 mentioned this pull request Oct 13, 2023
8 tasks
@kszlim
Copy link

kszlim commented Oct 14, 2023

Curious how this would work wrt checkpointing and determinism. If you want to have reproducibility and resume without re-iterating on the same data you've trained on, how would you ensure that?

c21 added a commit that referenced this pull request Oct 16, 2023
As a followup of #40154 (comment), remove the `FileMetadataShuffler` and the config setting in `DataContext` now. They are not used any more.

Signed-off-by: Cheng Su <[email protected]>
@c21
Copy link
Contributor Author

c21 commented Oct 16, 2023

Curious how this would work wrt checkpointing and determinism. If you want to have reproducibility and resume without re-iterating on the same data you've trained on, how would you ensure that?

@kszlim - good question. This PR only enables randomness for training. More design discussion needed to integrate into checkpointing and achieve resumability.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants