[data] Add iterator batch_format=None support, which will yield batches in the current batch format with zero copies #33562

ericl · 2023-03-21T23:29:20Z

Why are these changes needed?

This PR is a cleanup of #33536

It uses "None" instead of "zero-copy" as a batch format, since None has a similar meaning for batch_size, where it means a system-chosen batch size. Here "None" also means the system chosen optimal batch format.

Signed-off-by: Eric Liang <[email protected]>

amogkam · 2023-03-21T23:32:11Z

python/ray/data/dataset.py

@@ -379,7 +379,7 @@ def map_batches(
 *,
 batch_size: Optional[Union[int, Literal["default"]]] = "default",
 compute: Optional[Union[str, ComputeStrategy]] = None,
- batch_format: Literal["default", "pandas", "pyarrow", "numpy"] = "default",
+ batch_format: Optional[str] = "default",


keep it as Optional[Literal] for full explicitness of supported batch formats?

I feel like that is hard to maintain (per the inconsistencies in the code already), so opted to go unify on the shorter signature.

amogkam · 2023-03-21T23:34:09Z

lets also keep the documentation changes from https://github.com/ray-project/ray/pull/33536/files#diff-988f3832ac94d085daf61260175e2580920ebd1521dc760f58b426b94379d5b7L235?

Signed-off-by: Eric Liang <[email protected]>

ericl · 2023-03-22T02:14:07Z

lets also keep the documentation changes from https://github.com/ray-project/ray/pull/33536/files#diff-988f3832ac94d085daf61260175e2580920ebd1521dc760f58b426b94379d5b7L235?

Done

clarkzinzow · 2023-03-22T16:05:29Z

python/ray/data/dataset.py

@@ -540,7 +540,9 @@ def map_batches(
 (promotes tables to Pandas and tensors to NumPy), ``"pandas"`` to select
 ``pandas.DataFrame``, "pyarrow" to select ``pyarrow.Table``, or
 ``"numpy"`` to select ``numpy.ndarray`` for tensor datasets and
- ``Dict[str, numpy.ndarray]`` for tabular datasets. Default is "default".
+ ``Dict[str, numpy.ndarray]`` for tabular datasets, or None to return
+ the underlying block exactly as is with no additional formatting.


Nice, I like batch_size=None a good bit more than adding another literal string!

#33601) The failure in rllib should have been fixed by #33562 Verified with `python -m pytest rllib/core/learner/torch/tests/test_torch_learner.py::TestLearner::test_end_to_end_update`.

…es in the current batch format with zero copies (ray-project#33562) This PR is a cleanup of ray-project#33536 It uses "None" instead of "zero-copy" as a batch format, since None has a similar meaning for batch_size, where it means a system-chosen batch size. Here "None" also means the system chosen optimal batch format.

…project#324… (ray-project#33601) The failure in rllib should have been fixed by ray-project#33562 Verified with `python -m pytest rllib/core/learner/torch/tests/test_torch_learner.py::TestLearner::test_end_to_end_update`.

…es in the current batch format with zero copies (ray-project#33562) This PR is a cleanup of ray-project#33536 It uses "None" instead of "zero-copy" as a batch format, since None has a similar meaning for batch_size, where it means a system-chosen batch size. Here "None" also means the system chosen optimal batch format.

…project#324… (ray-project#33601) The failure in rllib should have been fixed by ray-project#33562 Verified with `python -m pytest rllib/core/learner/torch/tests/test_torch_learner.py::TestLearner::test_end_to_end_update`.

…project#324… (ray-project#33601) The failure in rllib should have been fixed by ray-project#33562 Verified with `python -m pytest rllib/core/learner/torch/tests/test_torch_learner.py::TestLearner::test_end_to_end_update`. Signed-off-by: bhuang <[email protected]>

…es in the current batch format with zero copies (ray-project#33562) This PR is a cleanup of ray-project#33536 It uses "None" instead of "zero-copy" as a batch format, since None has a similar meaning for batch_size, where it means a system-chosen batch size. Here "None" also means the system chosen optimal batch format. Signed-off-by: Jonathan Carter <[email protected]>

…project#324… (ray-project#33601) The failure in rllib should have been fixed by ray-project#33562 Verified with `python -m pytest rllib/core/learner/torch/tests/test_torch_learner.py::TestLearner::test_end_to_end_update`. Signed-off-by: Jonathan Carter <[email protected]>

…es in the current batch format with zero copies (ray-project#33562) This PR is a cleanup of ray-project#33536 It uses "None" instead of "zero-copy" as a batch format, since None has a similar meaning for batch_size, where it means a system-chosen batch size. Here "None" also means the system chosen optimal batch format. Signed-off-by: elliottower <[email protected]>

…project#324… (ray-project#33601) The failure in rllib should have been fixed by ray-project#33562 Verified with `python -m pytest rllib/core/learner/torch/tests/test_torch_learner.py::TestLearner::test_end_to_end_update`. Signed-off-by: elliottower <[email protected]>

…es in the current batch format with zero copies (ray-project#33562) This PR is a cleanup of ray-project#33536 It uses "None" instead of "zero-copy" as a batch format, since None has a similar meaning for batch_size, where it means a system-chosen batch size. Here "None" also means the system chosen optimal batch format. Signed-off-by: Jack He <[email protected]>

…project#324… (ray-project#33601) The failure in rllib should have been fixed by ray-project#33562 Verified with `python -m pytest rllib/core/learner/torch/tests/test_torch_learner.py::TestLearner::test_end_to_end_update`. Signed-off-by: Jack He <[email protected]>

ericl added 6 commits March 21, 2023 14:54

wip

5d15807

wip

43ba159

Signed-off-by: Eric Liang <[email protected]>

wip

26b8d92

Signed-off-by: Eric Liang <[email protected]>

update

79cd16c

Signed-off-by: Eric Liang <[email protected]>

update

d876c83

Signed-off-by: Eric Liang <[email protected]>

fix

b310e35

Signed-off-by: Eric Liang <[email protected]>

ericl requested review from scv119, clarkzinzow, jjyao, jianoaix and c21 as code owners March 21, 2023 23:29

ericl assigned c21, amogkam and jianoaix Mar 21, 2023

amogkam approved these changes Mar 21, 2023

View reviewed changes

ericl mentioned this pull request Mar 22, 2023

[Data] Deprecate dataset_format #33437

Merged

8 tasks

add zero copy docs

4e6aec4

Signed-off-by: Eric Liang <[email protected]>

ericl requested review from maxpumperla and a team as code owners March 22, 2023 02:14

clarkzinzow approved these changes Mar 22, 2023

View reviewed changes

ericl merged commit 68afa43 into ray-project:master Mar 22, 2023

jianoaix mentioned this pull request Mar 22, 2023

Revert "[Datasets] Revert "Enable streaming executor by default (#324… #33601

Merged

8 tasks

ericl mentioned this pull request Mar 25, 2023

[Datasets] Add a new "zero_copy" batch format #32662

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data] Add iterator batch_format=None support, which will yield batches in the current batch format with zero copies #33562

[data] Add iterator batch_format=None support, which will yield batches in the current batch format with zero copies #33562

ericl commented Mar 21, 2023

amogkam Mar 21, 2023

ericl Mar 22, 2023

amogkam commented Mar 21, 2023

ericl commented Mar 22, 2023

clarkzinzow Mar 22, 2023

[data] Add iterator batch_format=None support, which will yield batches in the current batch format with zero copies #33562

[data] Add iterator batch_format=None support, which will yield batches in the current batch format with zero copies #33562

Conversation

ericl commented Mar 21, 2023

Why are these changes needed?

amogkam Mar 21, 2023

Choose a reason for hiding this comment

ericl Mar 22, 2023

Choose a reason for hiding this comment

amogkam commented Mar 21, 2023

ericl commented Mar 22, 2023

clarkzinzow Mar 22, 2023

Choose a reason for hiding this comment