-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data] Add benchmark for Ray Data + Trainer #37624
Conversation
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Successful release test run: |
) | ||
parser.add_argument( | ||
"--num-workers", | ||
default=2, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we test on a larger scale as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the default value can be small, to ease local testing. But we should use a larger arg in release_tests.yaml
.
# TorchTrainer.fit() (step 3 above) | ||
|
||
|
||
def iterate(dataset, label, metrics): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's reuse the same method in https://github.com/ray-project/ray/blob/master/release/nightly_tests/dataset/image_loader_microbenchmark.py#L17-L27, we can put the method into a util.py.
metrics[label] = tput | ||
|
||
|
||
def get_transform(to_torch_tensor): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
return transform | ||
|
||
|
||
def crop_and_flip_image_batch(image_batch): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
) | ||
parser.add_argument( | ||
"--num-workers", | ||
default=2, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the default value can be small, to ease local testing. But we should use a larger arg in release_tests.yaml
.
.map_batches(crop_and_flip_image_batch) | ||
) | ||
# Iterate over the dataset. | ||
for i in range(args.num_epochs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably not useful. In this benchmark, I think we want to make the usage as close to the real training workloads as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we included this to compare throughput between the data ingestion and the training phases. If we don't need benchmarking for this part, I can just remove this iterate()
call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what we currently measure in the training loop function is already data ingestion throughput (because we don't apply a real model).
end_t = time.time() | ||
# Record throughput per epoch. | ||
epoch_tput = num_rows / (end_t - start_t) | ||
session.report({"tput": epoch_tput, "epoch": i}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also report the throughput in terms of size?
Signed-off-by: Scott Lee <[email protected]>
Also, what do we think about the current default batch size being 32? @c21 @raulchen @stephanie-wang |
) | ||
parser.add_argument( | ||
"--batch-size", | ||
default=32, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's GPU utiliaization looks like on dashboard?
I think the principle is that we want to maximize this number as long as the data can fit into GPU memory. Could you measure the size of each row? and then we can figure out a proper number. |
@scottjlee could you also update the benchmarks results when you have them? We can merge this benchmark first, and optimize perf later. |
I get a strange import error after the refactoring of the common utility functions into any file outside of the benchmark file itself:
This doesn't happen if the methods are located directly in the file. |
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Successful release test run with both tests passing: |
As a followup to #37624, add the following additional parameters for the multi-node training benchmark: - File type (image, parquet) - local shuffle buffer size - preserve_order (train config) - increases default # epochs to 10 Signed-off-by: Scott Lee <[email protected]>
Add a benchmark / release test for multi-node scenario involving the following steps: - load images from S3 with `ray.data.read_images` - apply preprocessing with `map_batches` - training with `TorchTrainer` In a sample run of the benchmark, we can see the throughput for the steps above: ``` {'ray.data+transform': 2.9862405760130937, 'ray.TorchTrainer.fit': 1.3551558704631792} ``` Signed-off-by: Scott Lee <[email protected]> Signed-off-by: NripeshN <[email protected]>
As a followup to ray-project#37624, add the following additional parameters for the multi-node training benchmark: - File type (image, parquet) - local shuffle buffer size - preserve_order (train config) - increases default # epochs to 10 Signed-off-by: Scott Lee <[email protected]> Signed-off-by: NripeshN <[email protected]>
Add a benchmark / release test for multi-node scenario involving the following steps: - load images from S3 with `ray.data.read_images` - apply preprocessing with `map_batches` - training with `TorchTrainer` In a sample run of the benchmark, we can see the throughput for the steps above: ``` {'ray.data+transform': 2.9862405760130937, 'ray.TorchTrainer.fit': 1.3551558704631792} ``` Signed-off-by: Scott Lee <[email protected]> Signed-off-by: harborn <[email protected]>
As a followup to ray-project#37624, add the following additional parameters for the multi-node training benchmark: - File type (image, parquet) - local shuffle buffer size - preserve_order (train config) - increases default # epochs to 10 Signed-off-by: Scott Lee <[email protected]> Signed-off-by: harborn <[email protected]>
Add a benchmark / release test for multi-node scenario involving the following steps: - load images from S3 with `ray.data.read_images` - apply preprocessing with `map_batches` - training with `TorchTrainer` In a sample run of the benchmark, we can see the throughput for the steps above: ``` {'ray.data+transform': 2.9862405760130937, 'ray.TorchTrainer.fit': 1.3551558704631792} ``` Signed-off-by: Scott Lee <[email protected]>
As a followup to ray-project#37624, add the following additional parameters for the multi-node training benchmark: - File type (image, parquet) - local shuffle buffer size - preserve_order (train config) - increases default # epochs to 10 Signed-off-by: Scott Lee <[email protected]>
Add a benchmark / release test for multi-node scenario involving the following steps: - load images from S3 with `ray.data.read_images` - apply preprocessing with `map_batches` - training with `TorchTrainer` In a sample run of the benchmark, we can see the throughput for the steps above: ``` {'ray.data+transform': 2.9862405760130937, 'ray.TorchTrainer.fit': 1.3551558704631792} ``` Signed-off-by: Scott Lee <[email protected]> Signed-off-by: e428265 <[email protected]>
As a followup to ray-project#37624, add the following additional parameters for the multi-node training benchmark: - File type (image, parquet) - local shuffle buffer size - preserve_order (train config) - increases default # epochs to 10 Signed-off-by: Scott Lee <[email protected]> Signed-off-by: e428265 <[email protected]>
Add a benchmark / release test for multi-node scenario involving the following steps: - load images from S3 with `ray.data.read_images` - apply preprocessing with `map_batches` - training with `TorchTrainer` In a sample run of the benchmark, we can see the throughput for the steps above: ``` {'ray.data+transform': 2.9862405760130937, 'ray.TorchTrainer.fit': 1.3551558704631792} ``` Signed-off-by: Scott Lee <[email protected]> Signed-off-by: Victor <[email protected]>
As a followup to ray-project#37624, add the following additional parameters for the multi-node training benchmark: - File type (image, parquet) - local shuffle buffer size - preserve_order (train config) - increases default # epochs to 10 Signed-off-by: Scott Lee <[email protected]> Signed-off-by: Victor <[email protected]>
Why are these changes needed?
Add a benchmark / release test for multi-node scenario involving the following steps:
ray.data.read_images
map_batches
TorchTrainer
In a sample run of the benchmark, we can see the throughput for the steps above:
Related issue number
Closes #37355
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.