-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[data] set iter_batches default batch_size #26869
Conversation
Signed-off-by: Matthew Deng <[email protected]>
Signed-off-by: Matthew Deng <[email protected]>
Signed-off-by: Matthew Deng <[email protected]>
LGTM but better someone from data team to accept it! |
Signed-off-by: Matthew Deng <[email protected]>
blocks as batches. Defaults to a system-chosen batch size. | ||
batch_size: Request a specific batch size, or None to use entire blocks | ||
as batches (blocks may contain different number of rows). | ||
Defaults to 4096. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a random pick or do we have some use cases around this number?
Also I wonder if changing this default will have impact on performance tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, we need to decide on perf vs stability by default, e.g. for many workloads (like CV training), 256 may be a better default.
Note that users are more likely to come to .iter_batches()
with a batch size in mind than for non-inference .map_batches()
, since the former will be tailored to the width of their data and the memory constraints of their model (which they're likely to have already thought about), while the latter is about making sure that their preprocessing UDF doesn't exhaust worker heap memory (which is more of a Datasets detail that they maybe haven't thought about yet).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My vote would be starting with something at the scale of 256 or 128 by default, or we could match other libraries like Keras and use 32 as the default batch size (I think this default batch size is a bit outdated, though, probably geared towards use cases and hardware from 6+ years ago): https://www.tensorflow.org/api_docs/python/tf/keras/Sequential#fit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. How about 256 then? It seems like a good middle ground number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I can change it to 256 - would this apply to map_batches
as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, just iter_batches(). map_batches() should optimize for performance since it's not connected to sgd.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this change should just be for .iter_batches()
!
@@ -2320,7 +2321,7 @@ def iter_rows(self, *, prefetch_blocks: int = 0) -> Iterator[Union[T, TableRow]] | |||
else "native" | |||
) | |||
for batch in self.iter_batches( | |||
prefetch_blocks=prefetch_blocks, batch_format=batch_format | |||
batch_size=None, prefetch_blocks=prefetch_blocks, batch_format=batch_format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm why not use the default 4096 here?
I am kind of worried about different behavior between iter_rows
(read entire block as one batch) vs iter_batches
(read 4096 bytes as one batch) here. Given the interface of iter_rows
just outputs row, so it should not matter to users when it comes to each batch size right?
If user is worried about performance, they should already use iter_batches()
instead of iter_rows()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@c21 .iter_rows()
provides a zero-copy row view of the underlying block, so such batching would provide an unnecessary extra slice step that currently results in a full copy of the batch, so it's much more efficient to grab the unsliced block and provide the zero-copy row view on that block.
It should also be noted that this batch_size=4096
is a consumer-side slice, we still fetch the entire block regardless of the batch size, so specifying a batch size here will not improve memory stability and will only hurt performance without any benefit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should also be noted that this batch_size=4096 is a consumer-side slice, we still fetch the entire block regardless of the batch size
@clarkzinzow - I see. That makes sense to me now. Thanks for explanation.
python/ray/data/dataset.py
Outdated
@@ -353,8 +353,9 @@ def map_batches( | |||
fn: The function to apply to each record batch, or a class type | |||
that can be instantiated to create such a callable. Callable classes are | |||
only supported for the actor compute strategy. | |||
batch_size: Request a specific batch size, or None to use entire | |||
blocks as batches. Defaults to a system-chosen batch size. | |||
batch_size: Request a specific batch size, or None to use entire blocks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we call out the size is in num of rows? I saw some confusion about 4096 being in num of bytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah yes, that would be great. e.g. pyarrow also calls out the row count.
@matthewdeng Could you rebase on/merge master and make similar default batch size changes for |
Signed-off-by: Matthew Deng <[email protected]>
Signed-off-by: Matthew Deng <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Signed-off-by: Matthew Deng <[email protected]>
Signed-off-by: Matthew Deng <[email protected]>
Signed-off-by: Matthew Deng <[email protected]>
Co-authored-by: Clark Zinzow <[email protected]>
…default-batch-size
This reverts commit b048c6f.
Why are these changes needed? Consumers (e.g. Train) may expect generated batches to be of the same size. Prior to this change, the default behavior would be for each batch to be one block, which may be of different sizes. Changes Set default batch_size to 256. This was chosen to be a sensible default for training workloads, which is intentionally different from the existing default batch_size value for Dataset.map_batches. Update docs for Dataset.iter_batches, Dataset.map_batches, and DatasetPipeline.iter_batches to be consistent. Updated tests and examples to explicitly pass in batch_size=None as these tests were intentionally testing block iteration, and there are other tests that test explicit batch sizes. Signed-off-by: Rohan138 <[email protected]>
ray-project#26938) This reverts commit b048c6f. Signed-off-by: Rohan138 <[email protected]>
Why are these changes needed? Resubmitting ray-project#26869. This PR was reverted due to failing tests; however, those failures were actually due to a dependency: ray-project#26950 Signed-off-by: Rohan138 <[email protected]>
Why are these changes needed? Consumers (e.g. Train) may expect generated batches to be of the same size. Prior to this change, the default behavior would be for each batch to be one block, which may be of different sizes. Changes Set default batch_size to 256. This was chosen to be a sensible default for training workloads, which is intentionally different from the existing default batch_size value for Dataset.map_batches. Update docs for Dataset.iter_batches, Dataset.map_batches, and DatasetPipeline.iter_batches to be consistent. Updated tests and examples to explicitly pass in batch_size=None as these tests were intentionally testing block iteration, and there are other tests that test explicit batch sizes. Signed-off-by: Stefan van der Kleij <[email protected]>
ray-project#26938) This reverts commit b048c6f. Signed-off-by: Stefan van der Kleij <[email protected]>
Why are these changes needed? Resubmitting ray-project#26869. This PR was reverted due to failing tests; however, those failures were actually due to a dependency: ray-project#26950 Signed-off-by: Stefan van der Kleij <[email protected]>
Signed-off-by: Matthew Deng [email protected]
Why are these changes needed?
Consumers (e.g. Train) may expect generated batches to be of the same size. Prior to this change, the default behavior would be for each batch to be one block, which may be of different sizes.
Changes
batch_size
to 256. This was chosen to be a sensible default for training workloads, which is intentionally different from the existing defaultbatch_size
value forDataset.map_batches
.Dataset.iter_batches
,Dataset.map_batches
, andDatasetPipeline.iter_batches
to be consistent.batch_size=None
as these tests were intentionally testing block iteration, and there are other tests that test explicit batch sizes.Questions
Optionality: Should
batch_size=None
be allowed? It's not clear if we want to allow batches to have different sizes. One reason would be to allow for zero-copy reads, but perhaps a separateiter_blocks
API can be exposed for this specific use case.Default Value: Should there be a default value, and if so what should it be? Currently it is set to 4096 to match
map_batches
, but is there a more sensible default? Or should users always specify this themselves?TODO
Related issue number
Checks
scripts/format.sh
to lint the changes in this PR.