[data] set iter_batches default batch_size #26869

matthewdeng · 2022-07-22T02:19:35Z

Signed-off-by: Matthew Deng [email protected]

Why are these changes needed?

Consumers (e.g. Train) may expect generated batches to be of the same size. Prior to this change, the default behavior would be for each batch to be one block, which may be of different sizes.

Changes

Set default batch_size to 256. This was chosen to be a sensible default for training workloads, which is intentionally different from the existing default batch_size value for Dataset.map_batches.
Update docs for Dataset.iter_batches, Dataset.map_batches, and DatasetPipeline.iter_batches to be consistent.
Updated tests and examples to explicitly pass in batch_size=None as these tests were intentionally testing block iteration, and there are other tests that test explicit batch sizes.

Questions

Optionality: Should batch_size=None be allowed? It's not clear if we want to allow batches to have different sizes. One reason would be to allow for zero-copy reads, but perhaps a separate iter_blocks API can be exposed for this specific use case.

Default Value: Should there be a default value, and if so what should it be? Currently it is set to 4096 to match map_batches, but is there a more sensible default? Or should users always specify this themselves?

TODO

Check references/docs for cases that use the default value and need to be updated.

Related issue number

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Matthew Deng <[email protected]>

python/ray/data/dataset.py

Signed-off-by: Matthew Deng <[email protected]>

scv119 · 2022-07-22T05:52:08Z

LGTM but better someone from data team to accept it!

Signed-off-by: Matthew Deng <[email protected]>

jianoaix · 2022-07-22T16:26:26Z

python/ray/data/dataset.py

- blocks as batches. Defaults to a system-chosen batch size.
+ batch_size: Request a specific batch size, or None to use entire blocks
+ as batches (blocks may contain different number of rows).
+ Defaults to 4096.


Is this a random pick or do we have some use cases around this number?

Also I wonder if changing this default will have impact on performance tests?

+1, we need to decide on perf vs stability by default, e.g. for many workloads (like CV training), 256 may be a better default.

Note that users are more likely to come to .iter_batches() with a batch size in mind than for non-inference .map_batches(), since the former will be tailored to the width of their data and the memory constraints of their model (which they're likely to have already thought about), while the latter is about making sure that their preprocessing UDF doesn't exhaust worker heap memory (which is more of a Datasets detail that they maybe haven't thought about yet).

My vote would be starting with something at the scale of 256 or 128 by default, or we could match other libraries like Keras and use 32 as the default batch size (I think this default batch size is a bit outdated, though, probably geared towards use cases and hardware from 6+ years ago): https://www.tensorflow.org/api_docs/python/tf/keras/Sequential#fit

+1. How about 256 then? It seems like a good middle ground number.

Yeah I can change it to 256 - would this apply to map_batches as well?

No, just iter_batches(). map_batches() should optimize for performance since it's not connected to sgd.

Yeah this change should just be for .iter_batches()!

c21 · 2022-07-22T17:16:42Z

python/ray/data/dataset.py

@@ -2320,7 +2321,7 @@ def iter_rows(self, *, prefetch_blocks: int = 0) -> Iterator[Union[T, TableRow]]
 else "native"
 )
 for batch in self.iter_batches(
- prefetch_blocks=prefetch_blocks, batch_format=batch_format
+ batch_size=None, prefetch_blocks=prefetch_blocks, batch_format=batch_format


hmm why not use the default 4096 here?

I am kind of worried about different behavior between iter_rows (read entire block as one batch) vs iter_batches (read 4096 bytes as one batch) here. Given the interface of iter_rows just outputs row, so it should not matter to users when it comes to each batch size right?

If user is worried about performance, they should already use iter_batches() instead of iter_rows().

@c21 .iter_rows() provides a zero-copy row view of the underlying block, so such batching would provide an unnecessary extra slice step that currently results in a full copy of the batch, so it's much more efficient to grab the unsliced block and provide the zero-copy row view on that block.

It should also be noted that this batch_size=4096 is a consumer-side slice, we still fetch the entire block regardless of the batch size, so specifying a batch size here will not improve memory stability and will only hurt performance without any benefit.

It should also be noted that this batch_size=4096 is a consumer-side slice, we still fetch the entire block regardless of the batch size

@clarkzinzow - I see. That makes sense to me now. Thanks for explanation.

jianoaix · 2022-07-22T18:03:48Z

python/ray/data/dataset.py

@@ -353,8 +353,9 @@ def map_batches(
 fn: The function to apply to each record batch, or a class type
 that can be instantiated to create such a callable. Callable classes are
 only supported for the actor compute strategy.
- batch_size: Request a specific batch size, or None to use entire
- blocks as batches. Defaults to a system-chosen batch size.
+ batch_size: Request a specific batch size, or None to use entire blocks


Can we call out the size is in num of rows? I saw some confusion about 4096 being in num of bytes.

ah yes, that would be great. e.g. pyarrow also calls out the row count.

clarkzinzow · 2022-07-22T18:41:27Z

@matthewdeng Could you rebase on/merge master and make similar default batch size changes for .iter_torch_batches() and .iter_tf_batches()?

Signed-off-by: Matthew Deng <[email protected]>

…h-size

Signed-off-by: Matthew Deng <[email protected]>

clarkzinzow

LGTM!

python/ray/data/dataset.py

python/ray/data/dataset_pipeline.py

Signed-off-by: Matthew Deng <[email protected]>

Co-authored-by: Clark Zinzow <[email protected]>

…default-batch-size

This reverts commit b048c6f.

Why are these changes needed? Resubmitting #26869. This PR was reverted due to failing tests; however, those failures were actually due to a dependency: #26950

Why are these changes needed? Consumers (e.g. Train) may expect generated batches to be of the same size. Prior to this change, the default behavior would be for each batch to be one block, which may be of different sizes. Changes Set default batch_size to 256. This was chosen to be a sensible default for training workloads, which is intentionally different from the existing default batch_size value for Dataset.map_batches. Update docs for Dataset.iter_batches, Dataset.map_batches, and DatasetPipeline.iter_batches to be consistent. Updated tests and examples to explicitly pass in batch_size=None as these tests were intentionally testing block iteration, and there are other tests that test explicit batch sizes. Signed-off-by: Rohan138 <[email protected]>

ray-project#26938) This reverts commit b048c6f. Signed-off-by: Rohan138 <[email protected]>

Why are these changes needed? Resubmitting ray-project#26869. This PR was reverted due to failing tests; however, those failures were actually due to a dependency: ray-project#26950 Signed-off-by: Rohan138 <[email protected]>

Why are these changes needed? Consumers (e.g. Train) may expect generated batches to be of the same size. Prior to this change, the default behavior would be for each batch to be one block, which may be of different sizes. Changes Set default batch_size to 256. This was chosen to be a sensible default for training workloads, which is intentionally different from the existing default batch_size value for Dataset.map_batches. Update docs for Dataset.iter_batches, Dataset.map_batches, and DatasetPipeline.iter_batches to be consistent. Updated tests and examples to explicitly pass in batch_size=None as these tests were intentionally testing block iteration, and there are other tests that test explicit batch sizes. Signed-off-by: Stefan van der Kleij <[email protected]>

ray-project#26938) This reverts commit b048c6f. Signed-off-by: Stefan van der Kleij <[email protected]>

Why are these changes needed? Resubmitting ray-project#26869. This PR was reverted due to failing tests; however, those failures were actually due to a dependency: ray-project#26950 Signed-off-by: Stefan van der Kleij <[email protected]>

[data] set iter_batches default batch_size

2b02ade

Signed-off-by: Matthew Deng <[email protected]>

matthewdeng requested review from ericl, scv119, clarkzinzow, jjyao and jianoaix as code owners July 22, 2022 02:19

matthewdeng requested a review from c21 July 22, 2022 02:22

matthewdeng assigned clarkzinzow Jul 22, 2022

ericl reviewed Jul 22, 2022

View reviewed changes

python/ray/data/dataset.py Outdated Show resolved Hide resolved

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 22, 2022

update doc

45ece5e

Signed-off-by: Matthew Deng <[email protected]>

matthewdeng removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 22, 2022

fix tests

32f6386

Signed-off-by: Matthew Deng <[email protected]>

fix tests

4f788e5

Signed-off-by: Matthew Deng <[email protected]>

jianoaix reviewed Jul 22, 2022

View reviewed changes

c21 reviewed Jul 22, 2022

View reviewed changes

jianoaix reviewed Jul 22, 2022

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 22, 2022

matthewdeng added 3 commits July 22, 2022 11:50

set default to 256

cede396

Signed-off-by: Matthew Deng <[email protected]>

Merge branch 'master' of github.com:ray-project/ray into default-batc…

8192d4c

…h-size

update iter_torch/tf_batches

80f0308

Signed-off-by: Matthew Deng <[email protected]>

clarkzinzow approved these changes Jul 22, 2022

View reviewed changes

python/ray/data/dataset.py Outdated Show resolved Hide resolved

python/ray/data/dataset.py Outdated Show resolved Hide resolved

python/ray/data/dataset.py Outdated Show resolved Hide resolved

python/ray/data/dataset_pipeline.py Outdated Show resolved Hide resolved

matthewdeng and others added 5 commits July 22, 2022 15:47

update preprocessors

027237a

Signed-off-by: Matthew Deng <[email protected]>

dummytrainer

ca73547

Signed-off-by: Matthew Deng <[email protected]>

update quick start

97a9f5a

Signed-off-by: Matthew Deng <[email protected]>

Apply suggestions from code review

977ec94

Co-authored-by: Clark Zinzow <[email protected]>

Merge branch 'default-batch-size' of github.com:matthewdeng/ray into …

5c24e50

…default-batch-size

matthewdeng requested a review from maxpumperla as a code owner July 22, 2022 23:19

matthewdeng added a commit to matthewdeng/ray that referenced this pull request Jul 24, 2022

Revert "[data] set iter_batches default batch_size ray-project#26869 "

1a34fc9

This reverts commit b048c6f.

scv119 pushed a commit that referenced this pull request Jul 24, 2022

Revert "[data] set iter_batches default batch_size #26869 " (#26938)

bcec60d

This reverts commit b048c6f.

suquark mentioned this pull request Jul 25, 2022

[Workflow] Workflow client integration #26702

Merged

6 tasks

matthewdeng deleted the default-batch-size branch July 25, 2022 04:03

matthewdeng mentioned this pull request Jul 25, 2022

[data] set iter_batches default batch_size #26955

Merged

6 tasks

scv119 pushed a commit that referenced this pull request Jul 25, 2022

[data] set iter_batches default batch_size (#26955)

3ea80f6

Why are these changes needed? Resubmitting #26869. This PR was reverted due to failing tests; however, those failures were actually due to a dependency: #26950

scv119 mentioned this pull request Jul 26, 2022

[AIR] Significant data reading regression in Ray cluster from xgboost 100GB test #26995

Closed

Rohan138 pushed a commit to Rohan138/ray that referenced this pull request Jul 28, 2022

Revert "[data] set iter_batches default batch_size ray-project#26869 " (

36eace0

ray-project#26938) This reverts commit b048c6f. Signed-off-by: Rohan138 <[email protected]>

Stefan-1313 pushed a commit to Stefan-1313/ray_mod that referenced this pull request Aug 18, 2022

Revert "[data] set iter_batches default batch_size ray-project#26869 " (

cb50dc3

ray-project#26938) This reverts commit b048c6f. Signed-off-by: Stefan van der Kleij <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data] set iter_batches default batch_size #26869

[data] set iter_batches default batch_size #26869

matthewdeng commented Jul 22, 2022 •

edited

Loading

scv119 commented Jul 22, 2022

jianoaix Jul 22, 2022

clarkzinzow Jul 22, 2022

clarkzinzow Jul 22, 2022 •

edited

Loading

ericl Jul 22, 2022

matthewdeng Jul 22, 2022

ericl Jul 22, 2022

clarkzinzow Jul 22, 2022

c21 Jul 22, 2022

clarkzinzow Jul 22, 2022

c21 Jul 22, 2022

jianoaix Jul 22, 2022

c21 Jul 22, 2022

clarkzinzow commented Jul 22, 2022

clarkzinzow left a comment

[data] set iter_batches default batch_size #26869

[data] set iter_batches default batch_size #26869

Conversation

matthewdeng commented Jul 22, 2022 • edited Loading

Why are these changes needed?

Changes

Questions

TODO

Related issue number

Checks

scv119 commented Jul 22, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clarkzinzow Jul 22, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clarkzinzow commented Jul 22, 2022

clarkzinzow left a comment

Choose a reason for hiding this comment

matthewdeng commented Jul 22, 2022 •

edited

Loading

clarkzinzow Jul 22, 2022 •

edited

Loading