[Data] Add benchmark for Ray Data + Trainer #37624

scottjlee · 2023-07-20T23:27:28Z

Why are these changes needed?

Add a benchmark / release test for multi-node scenario involving the following steps:

load images from S3 with ray.data.read_images
apply preprocessing with map_batches
training with TorchTrainer

In a sample run of the benchmark, we can see the throughput for the steps above:

{'ray.data+transform': 2.9862405760130937, 'ray.TorchTrainer.fit': 1.3551558704631792}

Related issue number

Closes #37355

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Scott Lee <[email protected]>

scottjlee · 2023-07-21T03:44:59Z

Successful release test run:

https://buildkite.com/ray-project/release-tests-pr/builds/46062#_

c21 · 2023-07-21T05:49:54Z

release/air_tests/air_benchmarks/workloads/train_benchmark.py

+ )
+ parser.add_argument(
+ "--num-workers",
+ default=2,


should we test on a larger scale as well?

I think the default value can be small, to ease local testing. But we should use a larger arg in release_tests.yaml.

c21 · 2023-07-21T06:51:23Z

release/air_tests/air_benchmarks/workloads/train_benchmark.py

+# TorchTrainer.fit() (step 3 above)
+
+
+def iterate(dataset, label, metrics):


let's reuse the same method in https://github.com/ray-project/ray/blob/master/release/nightly_tests/dataset/image_loader_microbenchmark.py#L17-L27, we can put the method into a util.py.

c21 · 2023-07-21T06:51:28Z

release/air_tests/air_benchmarks/workloads/train_benchmark.py

+ metrics[label] = tput
+
+
+def get_transform(to_torch_tensor):


c21 · 2023-07-21T06:51:35Z

release/air_tests/air_benchmarks/workloads/train_benchmark.py

+ return transform
+
+
+def crop_and_flip_image_batch(image_batch):


Signed-off-by: Scott Lee <[email protected]>

raulchen · 2023-07-21T18:26:02Z

release/air_tests/air_benchmarks/workloads/train_benchmark.py

+ )
+ parser.add_argument(
+ "--num-workers",
+ default=2,


I think the default value can be small, to ease local testing. But we should use a larger arg in release_tests.yaml.

raulchen · 2023-07-21T18:28:38Z

release/air_tests/air_benchmarks/workloads/train_benchmark.py

+ .map_batches(crop_and_flip_image_batch)
+ )
+ # Iterate over the dataset.
+ for i in range(args.num_epochs):


This is probably not useful. In this benchmark, I think we want to make the usage as close to the real training workloads as possible.

I think we included this to compare throughput between the data ingestion and the training phases. If we don't need benchmarking for this part, I can just remove this iterate() call.

I think what we currently measure in the training loop function is already data ingestion throughput (because we don't apply a real model).

raulchen · 2023-07-21T18:30:02Z

release/air_tests/air_benchmarks/workloads/train_benchmark.py

+ end_t = time.time()
+ # Record throughput per epoch.
+ epoch_tput = num_rows / (end_t - start_t)
+ session.report({"tput": epoch_tput, "epoch": i})


also report the throughput in terms of size?

Signed-off-by: Scott Lee <[email protected]>

scottjlee · 2023-07-21T19:50:35Z

Also, what do we think about the current default batch size being 32? @c21 @raulchen @stephanie-wang

c21 · 2023-07-21T20:00:14Z

release/nightly_tests/dataset/train_benchmark.py

+ )
+ parser.add_argument(
+ "--batch-size",
+ default=32,


what's GPU utiliaization looks like on dashboard?

raulchen · 2023-07-21T21:08:27Z

Also, what do we think about the current default batch size being 32? @c21 @raulchen @stephanie-wang

I think the principle is that we want to maximize this number as long as the data can fit into GPU memory. Could you measure the size of each row? and then we can figure out a proper number.

raulchen · 2023-07-21T22:48:10Z

@scottjlee could you also update the benchmarks results when you have them? We can merge this benchmark first, and optimize perf later.

…tils

scottjlee · 2023-07-22T01:05:28Z

I get a strange import error after the refactoring of the common utility functions into any file outside of the benchmark file itself:

(TrainTrainable pid=350, ip=10.0.27.110) Traceback (most recent call last):
(TrainTrainable pid=350, ip=10.0.27.110)   File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/serialization.py", line 387, in deserialize_objects
(TrainTrainable pid=350, ip=10.0.27.110)     obj = self._deserialize_object(data, metadata, object_ref)
(TrainTrainable pid=350, ip=10.0.27.110)   File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/serialization.py", line 268, in _deserialize_object
(TrainTrainable pid=350, ip=10.0.27.110)     return self._deserialize_msgpack_data(data, metadata_fields)
(TrainTrainable pid=350, ip=10.0.27.110)   File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/serialization.py", line 223, in _deserialize_msgpack_data
(TrainTrainable pid=350, ip=10.0.27.110)     python_objects = self._deserialize_pickle5_data(pickle5_data)
(TrainTrainable pid=350, ip=10.0.27.110)   File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/serialization.py", line 211, in _deserialize_pickle5_data
(TrainTrainable pid=350, ip=10.0.27.110)     obj = pickle.loads(in_band, buffers=buffers)
(TrainTrainable pid=350, ip=10.0.27.110) ModuleNotFoundError: No module named 'benchmark_utils'

This doesn't happen if the methods are located directly in the file.

Signed-off-by: Scott Lee <[email protected]>

scottjlee · 2023-07-25T03:48:19Z

Successful release test run with both tests passing:
https://buildkite.com/ray-project/release-tests-pr/builds/46436

As a followup to #37624, add the following additional parameters for the multi-node training benchmark: - File type (image, parquet) - local shuffle buffer size - preserve_order (train config) - increases default # epochs to 10 Signed-off-by: Scott Lee <[email protected]>

Add a benchmark / release test for multi-node scenario involving the following steps: - load images from S3 with `ray.data.read_images` - apply preprocessing with `map_batches` - training with `TorchTrainer` In a sample run of the benchmark, we can see the throughput for the steps above: ``` {'ray.data+transform': 2.9862405760130937, 'ray.TorchTrainer.fit': 1.3551558704631792} ``` Signed-off-by: Scott Lee <[email protected]> Signed-off-by: NripeshN <[email protected]>

As a followup to ray-project#37624, add the following additional parameters for the multi-node training benchmark: - File type (image, parquet) - local shuffle buffer size - preserve_order (train config) - increases default # epochs to 10 Signed-off-by: Scott Lee <[email protected]> Signed-off-by: NripeshN <[email protected]>

Add a benchmark / release test for multi-node scenario involving the following steps: - load images from S3 with `ray.data.read_images` - apply preprocessing with `map_batches` - training with `TorchTrainer` In a sample run of the benchmark, we can see the throughput for the steps above: ``` {'ray.data+transform': 2.9862405760130937, 'ray.TorchTrainer.fit': 1.3551558704631792} ``` Signed-off-by: Scott Lee <[email protected]> Signed-off-by: harborn <[email protected]>

As a followup to ray-project#37624, add the following additional parameters for the multi-node training benchmark: - File type (image, parquet) - local shuffle buffer size - preserve_order (train config) - increases default # epochs to 10 Signed-off-by: Scott Lee <[email protected]> Signed-off-by: harborn <[email protected]>

Add a benchmark / release test for multi-node scenario involving the following steps: - load images from S3 with `ray.data.read_images` - apply preprocessing with `map_batches` - training with `TorchTrainer` In a sample run of the benchmark, we can see the throughput for the steps above: ``` {'ray.data+transform': 2.9862405760130937, 'ray.TorchTrainer.fit': 1.3551558704631792} ``` Signed-off-by: Scott Lee <[email protected]>

As a followup to ray-project#37624, add the following additional parameters for the multi-node training benchmark: - File type (image, parquet) - local shuffle buffer size - preserve_order (train config) - increases default # epochs to 10 Signed-off-by: Scott Lee <[email protected]>

Add a benchmark / release test for multi-node scenario involving the following steps: - load images from S3 with `ray.data.read_images` - apply preprocessing with `map_batches` - training with `TorchTrainer` In a sample run of the benchmark, we can see the throughput for the steps above: ``` {'ray.data+transform': 2.9862405760130937, 'ray.TorchTrainer.fit': 1.3551558704631792} ``` Signed-off-by: Scott Lee <[email protected]> Signed-off-by: e428265 <[email protected]>

As a followup to ray-project#37624, add the following additional parameters for the multi-node training benchmark: - File type (image, parquet) - local shuffle buffer size - preserve_order (train config) - increases default # epochs to 10 Signed-off-by: Scott Lee <[email protected]> Signed-off-by: e428265 <[email protected]>

Add a benchmark / release test for multi-node scenario involving the following steps: - load images from S3 with `ray.data.read_images` - apply preprocessing with `map_batches` - training with `TorchTrainer` In a sample run of the benchmark, we can see the throughput for the steps above: ``` {'ray.data+transform': 2.9862405760130937, 'ray.TorchTrainer.fit': 1.3551558704631792} ``` Signed-off-by: Scott Lee <[email protected]> Signed-off-by: Victor <[email protected]>

As a followup to ray-project#37624, add the following additional parameters for the multi-node training benchmark: - File type (image, parquet) - local shuffle buffer size - preserve_order (train config) - increases default # epochs to 10 Signed-off-by: Scott Lee <[email protected]> Signed-off-by: Victor <[email protected]>

scottjlee added 7 commits July 20, 2023 16:26

initial benchmark code

b98e1a6

Signed-off-by: Scott Lee <[email protected]>

add into release tests

9635e22

lint

f103a51

Signed-off-by: Scott Lee <[email protected]>

update release test pathing

f6c2391

Signed-off-by: Scott Lee <[email protected]>

relative pathing

0e9c354

Signed-off-by: Scott Lee <[email protected]>

Merge branch 'master' into 0720-benchmark-train

ec75d20

Signed-off-by: Scott Lee <[email protected]>

move working dir to air_tests instead

2baba01

Signed-off-by: Scott Lee <[email protected]>

scottjlee marked this pull request as ready for review July 21, 2023 03:48

scottjlee assigned raulchen and c21 Jul 21, 2023

c21 reviewed Jul 21, 2023

View reviewed changes

c21 assigned stephanie-wang Jul 21, 2023

c21 reviewed Jul 21, 2023

View reviewed changes

scottjlee added 3 commits July 21, 2023 11:33

add 4/16 node tests

3391671

Signed-off-by: Scott Lee <[email protected]>

Merge branch 'master' into 0720-benchmark-train

2093c62

Signed-off-by: Scott Lee <[email protected]>

update benchmark utils

bd8f866

Signed-off-by: Scott Lee <[email protected]>

raulchen reviewed Jul 21, 2023

View reviewed changes

update script params

6b498ee

Signed-off-by: Scott Lee <[email protected]>

c21 reviewed Jul 21, 2023

View reviewed changes

raulchen approved these changes Jul 21, 2023

View reviewed changes

fix broken cluster env/compute directories and relative import with u…

0a80e2f

…tils

scottjlee mentioned this pull request Jul 21, 2023

[Data] [WIP] Large-scale data ingestion benchmark #37532

Closed

8 tasks

test import paths

e8df198

c21 approved these changes Jul 22, 2023

View reviewed changes

move utils to benchmark.py

f749365

scottjlee added the do-not-merge Do not merge this PR! label Jul 22, 2023

scottjlee added 7 commits July 24, 2023 11:11

try single node

8b4697b

Merge branch 'master' into 0720-benchmark-train

2da2175

Signed-off-by: Scott Lee <[email protected]>

revert refactor

2366d9c

Signed-off-by: Scott Lee <[email protected]>

lint

82ce8c6

Signed-off-by: Scott Lee <[email protected]>

Merge branch 'master' into 0720-benchmark-train

fbac655

Signed-off-by: Scott Lee <[email protected]>

resource yamls

a1416eb

Signed-off-by: Scott Lee <[email protected]>

yaml path

25e6f76

Signed-off-by: Scott Lee <[email protected]>

scottjlee added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. and removed do-not-merge Do not merge this PR! labels Jul 25, 2023

c21 approved these changes Jul 25, 2023

View reviewed changes

c21 merged commit 1dbd7c1 into ray-project:master Jul 25, 2023
1 of 2 checks passed

scottjlee mentioned this pull request Jul 28, 2023

[Data] Additional args for Data + Train benchmark #37839

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Add benchmark for Ray Data + Trainer #37624

[Data] Add benchmark for Ray Data + Trainer #37624

scottjlee commented Jul 20, 2023 •

edited

Loading

scottjlee commented Jul 21, 2023 •

edited

Loading

c21 Jul 21, 2023

raulchen Jul 21, 2023

c21 Jul 21, 2023

c21 Jul 21, 2023

c21 Jul 21, 2023

raulchen Jul 21, 2023

raulchen Jul 21, 2023

scottjlee Jul 21, 2023

raulchen Jul 21, 2023

raulchen Jul 21, 2023

scottjlee commented Jul 21, 2023

c21 Jul 21, 2023

raulchen commented Jul 21, 2023

raulchen commented Jul 21, 2023

scottjlee commented Jul 22, 2023

scottjlee commented Jul 25, 2023

		# TorchTrainer.fit() (step 3 above)


		def iterate(dataset, label, metrics):

[Data] Add benchmark for Ray Data + Trainer #37624

[Data] Add benchmark for Ray Data + Trainer #37624

Conversation

scottjlee commented Jul 20, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

scottjlee commented Jul 21, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scottjlee commented Jul 21, 2023

Choose a reason for hiding this comment

raulchen commented Jul 21, 2023

raulchen commented Jul 21, 2023

scottjlee commented Jul 22, 2023

scottjlee commented Jul 25, 2023

scottjlee commented Jul 20, 2023 •

edited

Loading

scottjlee commented Jul 21, 2023 •

edited

Loading