[Data] Add support for shuffling input files #40154

c21 · 2023-10-05T18:40:17Z

Why are these changes needed?

This PR is to add support for shuffling input files ordering, for all file-based data sources. The interface for controlling the behavior is through shuffle argument in all read APIs for file-based data sources:

# Enable input files shuffling with default seed
ds = ray.data.read_parquet(..., shuffle="files")
ds = ray.data.read_images(..., shuffle="files")

Several different interfaces considered but not chosen:

Add to DataContext. It has drawback that DataContext config controlling the subtle semantics difference for operators, which could introduce bugs later, and not consistent with rest of APIs.
Optimizer rule to push down randomize_block_order to data source: read_xxx().randomize_block_order(). This has the benefit of not introducing any new interface. But it has drawback: (1).randomize_block_order() API is exposing block concept, and hard for users to understand and use based on feedback in the past. (2).Pushdown can only happen when randomize_block_order() applied immediately after read_xxx(), and it's not safe to push down if there's more operation in between: read_xxx().map_batches().randomize_block_order(). This behavior will be hard for users to understand, and will cause issue when user code gets more complicated.
Introduce new argument into random_shuffle(file=True), and optimizer rule to push down to data source: read_xxx().random_shuffle(file=True). It has drawback similar to above 2.(2)., and I get a hard time to choose the name of new argument. file is not a good name here, because random_shuffle() should not care about whether data is coming form file or not (it's part of data source concepts). and I don't want to name it random_shuffle(block=True), so this exposes block concept, and make it even more confusing given we already have randomize_block_order().

Note:
The seed would be always using the default one, and not supported to change from users. After 2.8, we shall iterate on how to expose the seed option for users. One option is to have a Dataset.manual/set_seed() API to control the global seed of random number generator, but it's a bit too early to introduce now without users feedback.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

c21 · 2023-10-06T23:37:55Z

python/ray/data/datasource/file_meta_provider.py

@@ -284,6 +284,7 @@ def _get_block_metadata(
 if (
 prefetched_metadata is not None
 and len(prefetched_metadata) == num_fragments
+ and all(m is not None for m in prefetched_metadata)


This is needed for Parquet datasource, as previously it depends on Parquet datasource to just return a empty list if metadata is unknown.

stephanie-wang · 2023-10-10T19:44:41Z

LGTM! A couple questions:

Should we use shuffle_files=True instead of shuffle="files"?
Should we update the train.DataConfig to set this automatically? (in a separate PR)

raulchen · 2023-10-10T20:18:54Z

python/ray/data/tests/test_image.py

+ def test_random_shuffle(self, ray_start_regular_shared):
+ # NOTE: set preserve_order to True to allow consistent output behavior.
+ context = ray.data.DataContext.get_current()
+ preserve_order = context.execution_options.preserve_order


nit, define a context manager in conftest.py for enabling preserve_order. That would be clearer and reusable.

We also have the restore_data_context fixture.

+1 let's use a pytest fixture here

updated to use restore_data_context, thanks folks!

raulchen · 2023-10-10T20:19:21Z

python/ray/data/tests/test_image.py

+ file_paths == sorted(output_paths)
+ for output_paths in output_paths_list
+ ]
+ )


also test read_parquet, as it has a different implementation.

python/ray/data/read_api.py

python/ray/data/datasource/file_based_datasource.py

bveeramani · 2023-10-10T21:00:48Z

python/ray/data/tests/test_image.py

+ def test_random_shuffle(self, ray_start_regular_shared):
+ # NOTE: set preserve_order to True to allow consistent output behavior.
+ context = ray.data.DataContext.get_current()
+ preserve_order = context.execution_options.preserve_order


+1 let's use a pytest fixture here

python/ray/data/tests/test_image.py

python/ray/data/read_api.py

python/ray/data/datasource/file_based_datasource.py

bveeramani · 2023-10-10T21:09:54Z

python/ray/data/datasource/file_metadata_shuffler.py

- The file paths and their sizes after shuffling.
- """
- raise NotImplementedError
+class FileMetadataShuffler:


IMO this abstractions adds an unnecessary layer on indirection. It essential just wraps a single function. I think it'd be simpler if did something like this in FileBasedDatasource:

if shuffle == "files": metadata = np.random.shuffle(metadata)

If we introduce different shuffling methods in the future, we can always revisit this an introduce a new abstraction. But at this point, I think it's premature

I was also thinking about it. I wanted to save some code duplication, but it looks like there's no much duplication now. Remove this class for now.

Looks like the FileMetadataShuffler class is still here (?)

python/ray/data/datasource/parquet_datasource.py

python/ray/data/read_api.py

c21 · 2023-10-12T18:21:52Z

Should we use shuffle_files=True instead of shuffle="files"?

@stephanie-wang, according to discussion offline earlier, shuffle_files=True has the limitation which cannot be extended later, if we need to support shuffle seed, or different granularity of shuffle, we have to add more arguments separately. On the other hand a single shuffle argument we can overload the data type later, e.g. to support a ShuffleOption class via shuffle=ShuffleOption(seed=..., ...)

Should we update the train.DataConfig to set this automatically? (in a separate PR)

That's good question, probably not in 2.8. Need more discussion with Ray Train together. Would prefer to introduce on Data first, and gather users feedback.

c21 · 2023-10-12T18:22:57Z

All comments should be addressed, PTAL thanks! cc @raulchen, @stephanie-wang and @bveeramani.

bveeramani

I think we still need to remove file_metadata_shuffler.py (?), but other than that LGTM

bveeramani · 2023-10-12T18:28:15Z

python/ray/data/datasource/file_metadata_shuffler.py

- The file paths and their sizes after shuffling.
- """
- raise NotImplementedError
+class FileMetadataShuffler:


Looks like the FileMetadataShuffler class is still here (?)

c21 · 2023-10-12T18:33:25Z

I think we still need to remove file_metadata_shuffler.py (?), but other than that LGTM

Let me do the code removal in a separate PR. Several places need to be cleaned up, such as file_metadata_shuffler.py, https://github.com/ray-project/ray/blob/master/python/ray/data/_default_config.py and https://github.com/ray-project/ray/blob/master/python/ray/data/context.py#L174 .

bveeramani · 2023-10-12T18:37:46Z

I think we still need to remove file_metadata_shuffler.py (?), but other than that LGTM

Let me do the code removal in a separate PR. Several places need to be cleaned up, such as file_metadata_shuffler.py, https://github.com/ray-project/ray/blob/master/python/ray/data/_default_config.py and https://github.com/ray-project/ray/blob/master/python/ray/data/context.py#L174 .

Ah, gotcha. Sounds good

stephanie-wang · 2023-10-12T18:53:31Z

Should we use shuffle_files=True instead of shuffle="files"?

@stephanie-wang, according to discussion offline earlier, shuffle_files=True has the limitation which cannot be extended later, if we need to support shuffle seed, or different granularity of shuffle, we have to add more arguments separately. On the other hand a single shuffle argument we can overload the data type later, e.g. to support a ShuffleOption class via shuffle=ShuffleOption(seed=..., ...)

Should we update the train.DataConfig to set this automatically? (in a separate PR)

That's good question, probably not in 2.8. Need more discussion with Ray Train together. Would prefer to introduce on Data first, and gather users feedback.

Sounds good, thanks for the context.

Signed-off-by: Cheng Su <[email protected]>

kszlim · 2023-10-14T23:04:01Z

Curious how this would work wrt checkpointing and determinism. If you want to have reproducibility and resume without re-iterating on the same data you've trained on, how would you ensure that?

As a followup of #40154 (comment), remove the `FileMetadataShuffler` and the config setting in `DataContext` now. They are not used any more. Signed-off-by: Cheng Su <[email protected]>

c21 · 2023-10-16T18:49:33Z

Curious how this would work wrt checkpointing and determinism. If you want to have reproducibility and resume without re-iterating on the same data you've trained on, how would you ensure that?

@kszlim - good question. This PR only enables randomness for training. More design discussion needed to integrate into checkpointing and achieve resumability.

c21 requested review from ericl, scv119, amogkam, scottjlee, bveeramani, raulchen and stephanie-wang as code owners October 5, 2023 18:40

c21 assigned ericl, pcmoritz and raulchen Oct 5, 2023

c21 force-pushed the shuffle branch from 914abb5 to bcbe4c3 Compare October 6, 2023 23:30

c21 commented Oct 6, 2023

View reviewed changes

c21 assigned stephanie-wang and bveeramani Oct 10, 2023

raulchen reviewed Oct 10, 2023

View reviewed changes

bveeramani reviewed Oct 10, 2023

View reviewed changes

c21 force-pushed the shuffle branch from fd106ae to 3091d75 Compare October 12, 2023 18:18

bveeramani approved these changes Oct 12, 2023

View reviewed changes

stephanie-wang approved these changes Oct 12, 2023

View reviewed changes

c21 force-pushed the shuffle branch from 3091d75 to 25ba83c Compare October 12, 2023 19:56

raulchen approved these changes Oct 12, 2023

View reviewed changes

c21 added 3 commits October 12, 2023 15:27

Add support for shuffling input files

0224e00

Signed-off-by: Cheng Su <[email protected]>

Change to put shuffle in read_xxx APIs

dcff501

Signed-off-by: Cheng Su <[email protected]>

Minor change to fix passing shuffle argument

bdcc2f7

Signed-off-by: Cheng Su <[email protected]>

Address all comments

64c5b00

Signed-off-by: Cheng Su <[email protected]>

c21 force-pushed the shuffle branch from 25ba83c to 64c5b00 Compare October 12, 2023 22:27

Fix unit test

ef2d2ef

Signed-off-by: Cheng Su <[email protected]>

c21 merged commit ba6ae3e into ray-project:master Oct 13, 2023
27 of 40 checks passed

c21 deleted the shuffle branch October 13, 2023 18:08

c21 mentioned this pull request Oct 13, 2023

[Data] Remove FileMetadataShuffler #40341

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Add support for shuffling input files #40154

[Data] Add support for shuffling input files #40154

c21 commented Oct 5, 2023 •

edited

Loading

c21 Oct 6, 2023

stephanie-wang commented Oct 10, 2023

raulchen Oct 10, 2023

stephanie-wang Oct 10, 2023

bveeramani Oct 10, 2023

c21 Oct 12, 2023

raulchen Oct 10, 2023

c21 Oct 12, 2023

bveeramani Oct 10, 2023

bveeramani Oct 10, 2023

c21 Oct 12, 2023

bveeramani Oct 12, 2023

c21 commented Oct 12, 2023

c21 commented Oct 12, 2023

bveeramani left a comment

bveeramani Oct 12, 2023

c21 commented Oct 12, 2023

bveeramani commented Oct 12, 2023

stephanie-wang commented Oct 12, 2023

kszlim commented Oct 14, 2023

c21 commented Oct 16, 2023

[Data] Add support for shuffling input files #40154

[Data] Add support for shuffling input files #40154

Conversation

c21 commented Oct 5, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

stephanie-wang commented Oct 10, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

c21 commented Oct 12, 2023

c21 commented Oct 12, 2023

bveeramani left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

c21 commented Oct 12, 2023

bveeramani commented Oct 12, 2023

stephanie-wang commented Oct 12, 2023

kszlim commented Oct 14, 2023

c21 commented Oct 16, 2023

c21 commented Oct 5, 2023 •

edited

Loading