Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[data] set iter_batches default batch_size #26869

Merged
merged 13 commits into from
Jul 23, 2022

Conversation

matthewdeng
Copy link
Contributor

@matthewdeng matthewdeng commented Jul 22, 2022

Signed-off-by: Matthew Deng [email protected]

Why are these changes needed?

Consumers (e.g. Train) may expect generated batches to be of the same size. Prior to this change, the default behavior would be for each batch to be one block, which may be of different sizes.

Changes

  1. Set default batch_size to 256. This was chosen to be a sensible default for training workloads, which is intentionally different from the existing default batch_size value for Dataset.map_batches.
  2. Update docs for Dataset.iter_batches, Dataset.map_batches, and DatasetPipeline.iter_batches to be consistent.
  3. Updated tests and examples to explicitly pass in batch_size=None as these tests were intentionally testing block iteration, and there are other tests that test explicit batch sizes.

Questions

Optionality: Should batch_size=None be allowed? It's not clear if we want to allow batches to have different sizes. One reason would be to allow for zero-copy reads, but perhaps a separate iter_blocks API can be exposed for this specific use case.

Default Value: Should there be a default value, and if so what should it be? Currently it is set to 4096 to match map_batches, but is there a more sensible default? Or should users always specify this themselves?

TODO

  • Check references/docs for cases that use the default value and need to be updated.

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

python/ray/data/dataset.py Outdated Show resolved Hide resolved
@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 22, 2022
Signed-off-by: Matthew Deng <[email protected]>
@matthewdeng matthewdeng removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 22, 2022
Signed-off-by: Matthew Deng <[email protected]>
@scv119
Copy link
Contributor

scv119 commented Jul 22, 2022

LGTM but better someone from data team to accept it!

Signed-off-by: Matthew Deng <[email protected]>
blocks as batches. Defaults to a system-chosen batch size.
batch_size: Request a specific batch size, or None to use entire blocks
as batches (blocks may contain different number of rows).
Defaults to 4096.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a random pick or do we have some use cases around this number?

Also I wonder if changing this default will have impact on performance tests?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, we need to decide on perf vs stability by default, e.g. for many workloads (like CV training), 256 may be a better default.

Note that users are more likely to come to .iter_batches() with a batch size in mind than for non-inference .map_batches(), since the former will be tailored to the width of their data and the memory constraints of their model (which they're likely to have already thought about), while the latter is about making sure that their preprocessing UDF doesn't exhaust worker heap memory (which is more of a Datasets detail that they maybe haven't thought about yet).

Copy link
Contributor

@clarkzinzow clarkzinzow Jul 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My vote would be starting with something at the scale of 256 or 128 by default, or we could match other libraries like Keras and use 32 as the default batch size (I think this default batch size is a bit outdated, though, probably geared towards use cases and hardware from 6+ years ago): https://www.tensorflow.org/api_docs/python/tf/keras/Sequential#fit

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. How about 256 then? It seems like a good middle ground number.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I can change it to 256 - would this apply to map_batches as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, just iter_batches(). map_batches() should optimize for performance since it's not connected to sgd.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this change should just be for .iter_batches()!

@@ -2320,7 +2321,7 @@ def iter_rows(self, *, prefetch_blocks: int = 0) -> Iterator[Union[T, TableRow]]
else "native"
)
for batch in self.iter_batches(
prefetch_blocks=prefetch_blocks, batch_format=batch_format
batch_size=None, prefetch_blocks=prefetch_blocks, batch_format=batch_format
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm why not use the default 4096 here?

I am kind of worried about different behavior between iter_rows (read entire block as one batch) vs iter_batches (read 4096 bytes as one batch) here. Given the interface of iter_rows just outputs row, so it should not matter to users when it comes to each batch size right?

If user is worried about performance, they should already use iter_batches() instead of iter_rows().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@c21 .iter_rows() provides a zero-copy row view of the underlying block, so such batching would provide an unnecessary extra slice step that currently results in a full copy of the batch, so it's much more efficient to grab the unsliced block and provide the zero-copy row view on that block.

It should also be noted that this batch_size=4096 is a consumer-side slice, we still fetch the entire block regardless of the batch size, so specifying a batch size here will not improve memory stability and will only hurt performance without any benefit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should also be noted that this batch_size=4096 is a consumer-side slice, we still fetch the entire block regardless of the batch size

@clarkzinzow - I see. That makes sense to me now. Thanks for explanation.

@@ -353,8 +353,9 @@ def map_batches(
fn: The function to apply to each record batch, or a class type
that can be instantiated to create such a callable. Callable classes are
only supported for the actor compute strategy.
batch_size: Request a specific batch size, or None to use entire
blocks as batches. Defaults to a system-chosen batch size.
batch_size: Request a specific batch size, or None to use entire blocks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we call out the size is in num of rows? I saw some confusion about 4096 being in num of bytes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yes, that would be great. e.g. pyarrow also calls out the row count.

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 22, 2022
@clarkzinzow
Copy link
Contributor

@matthewdeng Could you rebase on/merge master and make similar default batch size changes for .iter_torch_batches() and .iter_tf_batches()?

Copy link
Contributor

@clarkzinzow clarkzinzow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

python/ray/data/dataset.py Outdated Show resolved Hide resolved
python/ray/data/dataset.py Outdated Show resolved Hide resolved
python/ray/data/dataset.py Outdated Show resolved Hide resolved
python/ray/data/dataset_pipeline.py Outdated Show resolved Hide resolved
matthewdeng added a commit to matthewdeng/ray that referenced this pull request Jul 24, 2022
scv119 pushed a commit that referenced this pull request Jul 24, 2022
@matthewdeng matthewdeng deleted the default-batch-size branch July 25, 2022 04:03
scv119 pushed a commit that referenced this pull request Jul 25, 2022
Why are these changes needed?
Resubmitting #26869.

This PR was reverted due to failing tests; however, those failures were actually due to a dependency: #26950
Rohan138 pushed a commit to Rohan138/ray that referenced this pull request Jul 28, 2022
Why are these changes needed?
Consumers (e.g. Train) may expect generated batches to be of the same size. Prior to this change, the default behavior would be for each batch to be one block, which may be of different sizes.

Changes
Set default batch_size to 256. This was chosen to be a sensible default for training workloads, which is intentionally different from the existing default batch_size value for Dataset.map_batches.
Update docs for Dataset.iter_batches, Dataset.map_batches, and DatasetPipeline.iter_batches to be consistent.
Updated tests and examples to explicitly pass in batch_size=None as these tests were intentionally testing block iteration, and there are other tests that test explicit batch sizes.

Signed-off-by: Rohan138 <[email protected]>
Rohan138 pushed a commit to Rohan138/ray that referenced this pull request Jul 28, 2022
Rohan138 pushed a commit to Rohan138/ray that referenced this pull request Jul 28, 2022
Why are these changes needed?
Resubmitting ray-project#26869.

This PR was reverted due to failing tests; however, those failures were actually due to a dependency: ray-project#26950

Signed-off-by: Rohan138 <[email protected]>
Stefan-1313 pushed a commit to Stefan-1313/ray_mod that referenced this pull request Aug 18, 2022
Why are these changes needed?
Consumers (e.g. Train) may expect generated batches to be of the same size. Prior to this change, the default behavior would be for each batch to be one block, which may be of different sizes.

Changes
Set default batch_size to 256. This was chosen to be a sensible default for training workloads, which is intentionally different from the existing default batch_size value for Dataset.map_batches.
Update docs for Dataset.iter_batches, Dataset.map_batches, and DatasetPipeline.iter_batches to be consistent.
Updated tests and examples to explicitly pass in batch_size=None as these tests were intentionally testing block iteration, and there are other tests that test explicit batch sizes.

Signed-off-by: Stefan van der Kleij <[email protected]>
Stefan-1313 pushed a commit to Stefan-1313/ray_mod that referenced this pull request Aug 18, 2022
Stefan-1313 pushed a commit to Stefan-1313/ray_mod that referenced this pull request Aug 18, 2022
Why are these changes needed?
Resubmitting ray-project#26869.

This PR was reverted due to failing tests; however, those failures were actually due to a dependency: ray-project#26950

Signed-off-by: Stefan van der Kleij <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants