Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[data] Make sure the tf and tensor iteration work in dataset pipeline #34248

Merged
merged 27 commits into from
Apr 12, 2023

Conversation

jianoaix
Copy link
Contributor

@jianoaix jianoaix commented Apr 10, 2023

Why are these changes needed?

With the new dataset iterator API, the iteration of tf and torch from DatasetPipeline is broken.

See: #33994

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@jianoaix jianoaix added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Apr 11, 2023
it = ds.iterator()
for _ in range(2):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By keeping using the DatasetPipeline.iter_batches(), the consumption is not repeatable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is changing the actual behavior of PipelineDatasetIterator

Copy link
Collaborator

@zhe-thoughts zhe-thoughts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved for picking into 2.4 branch

@@ -99,13 +95,13 @@ def test_tf_e2e_pipeline(ray_start_regular_shared):
ds = ray.data.range_table(5).repeat(2)
it = ds.iterator()
model = build_model()
model.fit(it.to_tf("value", "value"), epochs=2)
model.fit(it.to_tf("value", "value"), epochs=1)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@amogkam amogkam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this is changing the actual behavior of PipelineDatasetIterator which I don't think is what we want.

The behavior difference in DatasetPipeline and PipelineDatasetIterator was intentional. For PipelineDatasetIterator, we want each call to iter to only return a single epoch's worth of data, not the entire repeated dataset.

# Set prefetch_batches to default of 0 for DatasetPipeline.
return super().iter_batches(
prefetch_batches=prefetch_batches,
yield from self._base_dataset_pipeline.iter_batches(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this changing the behavior of PipelineDatasetIterator? Not using self._base_dataset_pipeline was intentional so that each call to iter would iterate over the next epoch.

@amogkam
Copy link
Contributor

amogkam commented Apr 11, 2023

The fix should just be to not directly call Dataset.iter_torch_batches, Dataset.to_tf, and Dataset.to_torch in dataset_pipeline.py

@jianoaix
Copy link
Contributor Author

The fix should just be to not directly call Dataset.iter_torch_batches, Dataset.to_tf, and Dataset.to_torch in dataset_pipeline.py

It doesn't seem this has fundamental difference than just changing the PipelinedDatasetIterator since this iterator is only used by those tf/torch APIs called from DatasetPipeline, right?

@amogkam
Copy link
Contributor

amogkam commented Apr 11, 2023

Ah no, PipelinedDatasetIterator is used for the Ray Train integration.

Copy link
Contributor

@amogkam amogkam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, thanks!

@@ -31,6 +31,13 @@
from ray.data.dataset import TensorFlowTensorBatchType


def _is_tensor_dataset(schema) -> bool:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No-op change, just to make the to_tf sharable to DatasetPipeline.

@jianoaix
Copy link
Contributor Author

Tests passed (failure not relevant).

@jianoaix jianoaix removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Apr 12, 2023
@jianoaix jianoaix merged commit 66d3aaf into ray-project:master Apr 12, 2023
jianoaix added a commit to jianoaix/ray that referenced this pull request Apr 12, 2023
…ray-project#34248)

* Revert "[Datasets] Revert "Enable streaming executor by default (ray-project#32493)" (ray-project#33485)"

This reverts commit 5c79954.

* make sure tf and tensor iteration in datapipeline work

* Fix

* fix

* fix

* fix

* feedback

* feedback

* fix
clarng pushed a commit that referenced this pull request Apr 12, 2023
…dataset pipeline (#34296)


* [data] Make sure the tf and tensor iteration work in dataset pipeline (#34248)
elliottower pushed a commit to elliottower/ray that referenced this pull request Apr 22, 2023
…ray-project#34248)

* Revert "[Datasets] Revert "Enable streaming executor by default (ray-project#32493)" (ray-project#33485)"

This reverts commit 5c79954.

* make sure tf and tensor iteration in datapipeline work

* Fix

* fix

* fix

* fix

* feedback

* feedback

* fix

Signed-off-by: elliottower <[email protected]>
ProjectsByJackHe pushed a commit to ProjectsByJackHe/ray that referenced this pull request May 4, 2023
…ray-project#34248)

* Revert "[Datasets] Revert "Enable streaming executor by default (ray-project#32493)" (ray-project#33485)"

This reverts commit 5c79954.

* make sure tf and tensor iteration in datapipeline work

* Fix

* fix

* fix

* fix

* feedback

* feedback

* fix

Signed-off-by: Jack He <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants