[data] Make sure the tf and tensor iteration work in dataset pipeline #34248

jianoaix · 2023-04-10T23:40:29Z

Why are these changes needed?

With the new dataset iterator API, the iteration of tf and torch from DatasetPipeline is broken.

See: #33994

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

…project#32493)" (ray-project#33485)" This reverts commit 5c79954.

…lineitertorchbatches

jianoaix · 2023-04-11T02:41:43Z

python/ray/data/tests/test_dataset_iterator.py

 it = ds.iterator()
- for _ in range(2):


By keeping using the DatasetPipeline.iter_batches(), the consumption is not repeatable.

this is changing the actual behavior of PipelineDatasetIterator

zhe-thoughts

Approved for picking into 2.4 branch

…lineitertorchbatches

jianoaix · 2023-04-11T20:05:07Z

python/ray/data/tests/test_dataset_iterator.py

@@ -99,13 +95,13 @@ def test_tf_e2e_pipeline(ray_start_regular_shared):
 ds = ray.data.range_table(5).repeat(2)
 it = ds.iterator()
 model = build_model()
- model.fit(it.to_tf("value", "value"), epochs=2)
+ model.fit(it.to_tf("value", "value"), epochs=1)


cc @amogkam

amogkam

Looks like this is changing the actual behavior of PipelineDatasetIterator which I don't think is what we want.

The behavior difference in DatasetPipeline and PipelineDatasetIterator was intentional. For PipelineDatasetIterator, we want each call to iter to only return a single epoch's worth of data, not the entire repeated dataset.

amogkam · 2023-04-11T20:09:15Z

python/ray/data/_internal/dataset_iterator/pipelined_dataset_iterator.py

- # Set prefetch_batches to default of 0 for DatasetPipeline.
- return super().iter_batches(
- prefetch_batches=prefetch_batches,
+ yield from self._base_dataset_pipeline.iter_batches(


isn't this changing the behavior of PipelineDatasetIterator? Not using self._base_dataset_pipeline was intentional so that each call to iter would iterate over the next epoch.

amogkam · 2023-04-11T20:14:39Z

The fix should just be to not directly call Dataset.iter_torch_batches, Dataset.to_tf, and Dataset.to_torch in dataset_pipeline.py

jianoaix · 2023-04-11T20:31:56Z

The fix should just be to not directly call Dataset.iter_torch_batches, Dataset.to_tf, and Dataset.to_torch in dataset_pipeline.py

It doesn't seem this has fundamental difference than just changing the PipelinedDatasetIterator since this iterator is only used by those tf/torch APIs called from DatasetPipeline, right?

amogkam · 2023-04-11T21:35:07Z

Ah no, PipelinedDatasetIterator is used for the Ray Train integration.

amogkam

nice, thanks!

jianoaix · 2023-04-11T22:42:21Z

python/ray/data/dataset_iterator.py

@@ -31,6 +31,13 @@
 from ray.data.dataset import TensorFlowTensorBatchType


+def _is_tensor_dataset(schema) -> bool:


No-op change, just to make the to_tf sharable to DatasetPipeline.

…lineitertorchbatches

jianoaix · 2023-04-12T00:51:00Z

Tests passed (failure not relevant).

…ray-project#34248) * Revert "[Datasets] Revert "Enable streaming executor by default (ray-project#32493)" (ray-project#33485)" This reverts commit 5c79954. * make sure tf and tensor iteration in datapipeline work * Fix * fix * fix * fix * feedback * feedback * fix

…dataset pipeline (#34296) * [data] Make sure the tf and tensor iteration work in dataset pipeline (#34248)

…ray-project#34248) * Revert "[Datasets] Revert "Enable streaming executor by default (ray-project#32493)" (ray-project#33485)" This reverts commit 5c79954. * make sure tf and tensor iteration in datapipeline work * Fix * fix * fix * fix * feedback * feedback * fix Signed-off-by: elliottower <[email protected]>

…ray-project#34248) * Revert "[Datasets] Revert "Enable streaming executor by default (ray-project#32493)" (ray-project#33485)" This reverts commit 5c79954. * make sure tf and tensor iteration in datapipeline work * Fix * fix * fix * fix * feedback * feedback * fix Signed-off-by: Jack He <[email protected]>

jianoaix added 17 commits March 22, 2023 20:33

Revert "[Datasets] Revert "Enable streaming executor by default (ray-…

925a247

…project#32493)" (ray-project#33485)" This reverts commit 5c79954.

Merge branch 'master' of https://github.com/ray-project/ray

b33ae23

Merge branch 'master' of https://github.com/ray-project/ray

4ef5d35

Merge branch 'master' of https://github.com/ray-project/ray

e6dcd6e

Merge branch 'master' of https://github.com/ray-project/ray

482e9dc

Merge branch 'master' of https://github.com/ray-project/ray

3e2d393

Merge branch 'master' of https://github.com/ray-project/ray

cb0840c

Merge branch 'master' of https://github.com/ray-project/ray

476e4f5

Merge branch 'master' of https://github.com/ray-project/ray

c0fe5a3

Merge branch 'master' of https://github.com/ray-project/ray

fb82ed2

Merge branch 'master' of https://github.com/ray-project/ray

aaac4b4

Merge branch 'master' of https://github.com/ray-project/ray

7529765

Merge branch 'master' of https://github.com/ray-project/ray

31415df

Merge branch 'master' of https://github.com/ray-project/ray

b207c74

Merge branch 'master' of https://github.com/ray-project/ray

1ba2e7d

make sure tf and tensor iteration in datapipeline work

195d76f

Fix

706875e

jianoaix requested review from ericl, scv119, clarkzinzow, jjyao and c21 as code owners April 10, 2023 23:40

jianoaix assigned ericl and c21 Apr 10, 2023

ericl approved these changes Apr 11, 2023

View reviewed changes

Merge branch 'master' of https://github.com/ray-project/ray into pipe…

a059dee

…lineitertorchbatches

jianoaix assigned zhe-thoughts Apr 11, 2023

jianoaix added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Apr 11, 2023

jianoaix added 2 commits April 11, 2023 02:39

fix

3ad60e1

Merge branch 'master' of https://github.com/ray-project/ray into pipe…

67a74b0

…lineitertorchbatches

jianoaix commented Apr 11, 2023

View reviewed changes

zhe-thoughts approved these changes Apr 11, 2023

View reviewed changes

jianoaix added 3 commits April 11, 2023 16:52

fix

b855e89

Merge branch 'master' of https://github.com/ray-project/ray into pipe…

1a4d003

…lineitertorchbatches

fix

ffa90ae

jianoaix commented Apr 11, 2023

View reviewed changes

amogkam requested changes Apr 11, 2023

View reviewed changes

jianoaix added 2 commits April 11, 2023 22:20

feedback

e311749

feedback

33ece2f

amogkam approved these changes Apr 11, 2023

View reviewed changes

jianoaix commented Apr 11, 2023

View reviewed changes

jianoaix added 2 commits April 11, 2023 23:08

Merge branch 'master' of https://github.com/ray-project/ray into pipe…

168ac0b

…lineitertorchbatches

fix

02b20ac

jianoaix removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Apr 12, 2023

jianoaix merged commit 66d3aaf into ray-project:master Apr 12, 2023

clarng pushed a commit that referenced this pull request Apr 12, 2023

[Cherrypick 2.4][data] Make sure the tf and tensor iteration work in …

6b12b9d

…dataset pipeline (#34296) * [data] Make sure the tf and tensor iteration work in dataset pipeline (#34248)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data] Make sure the tf and tensor iteration work in dataset pipeline #34248

[data] Make sure the tf and tensor iteration work in dataset pipeline #34248

jianoaix commented Apr 10, 2023 •

edited

Loading

jianoaix Apr 11, 2023

amogkam Apr 11, 2023

zhe-thoughts left a comment

jianoaix Apr 11, 2023

amogkam left a comment

amogkam Apr 11, 2023

amogkam commented Apr 11, 2023 •

edited

Loading

jianoaix commented Apr 11, 2023

amogkam commented Apr 11, 2023

amogkam left a comment

jianoaix Apr 11, 2023

jianoaix commented Apr 12, 2023

		@@ -31,6 +31,13 @@
		from ray.data.dataset import TensorFlowTensorBatchType


		def _is_tensor_dataset(schema) -> bool:

[data] Make sure the tf and tensor iteration work in dataset pipeline #34248

[data] Make sure the tf and tensor iteration work in dataset pipeline #34248

Conversation

jianoaix commented Apr 10, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

jianoaix Apr 11, 2023

Choose a reason for hiding this comment

amogkam Apr 11, 2023

Choose a reason for hiding this comment

zhe-thoughts left a comment

Choose a reason for hiding this comment

jianoaix Apr 11, 2023

Choose a reason for hiding this comment

amogkam left a comment

Choose a reason for hiding this comment

amogkam Apr 11, 2023

Choose a reason for hiding this comment

amogkam commented Apr 11, 2023 • edited Loading

jianoaix commented Apr 11, 2023

amogkam commented Apr 11, 2023

amogkam left a comment

Choose a reason for hiding this comment

jianoaix Apr 11, 2023

Choose a reason for hiding this comment

jianoaix commented Apr 12, 2023

jianoaix commented Apr 10, 2023 •

edited

Loading

amogkam commented Apr 11, 2023 •

edited

Loading