Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] Add benchmark for Ray Data + Trainer #37624

Merged
merged 21 commits into from
Jul 25, 2023

Conversation

scottjlee
Copy link
Contributor

@scottjlee scottjlee commented Jul 20, 2023

Why are these changes needed?

Add a benchmark / release test for multi-node scenario involving the following steps:

  • load images from S3 with ray.data.read_images
  • apply preprocessing with map_batches
  • training with TorchTrainer

In a sample run of the benchmark, we can see the throughput for the steps above:

{'ray.data+transform': 2.9862405760130937, 'ray.TorchTrainer.fit': 1.3551558704631792}

Related issue number

Closes #37355

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@scottjlee
Copy link
Contributor Author

scottjlee commented Jul 21, 2023

Successful release test run:

@scottjlee scottjlee marked this pull request as ready for review July 21, 2023 03:48
)
parser.add_argument(
"--num-workers",
default=2,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we test on a larger scale as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the default value can be small, to ease local testing. But we should use a larger arg in release_tests.yaml.

# TorchTrainer.fit() (step 3 above)


def iterate(dataset, label, metrics):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

metrics[label] = tput


def get_transform(to_torch_tensor):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

return transform


def crop_and_flip_image_batch(image_batch):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

)
parser.add_argument(
"--num-workers",
default=2,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the default value can be small, to ease local testing. But we should use a larger arg in release_tests.yaml.

.map_batches(crop_and_flip_image_batch)
)
# Iterate over the dataset.
for i in range(args.num_epochs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably not useful. In this benchmark, I think we want to make the usage as close to the real training workloads as possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we included this to compare throughput between the data ingestion and the training phases. If we don't need benchmarking for this part, I can just remove this iterate() call.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what we currently measure in the training loop function is already data ingestion throughput (because we don't apply a real model).

end_t = time.time()
# Record throughput per epoch.
epoch_tput = num_rows / (end_t - start_t)
session.report({"tput": epoch_tput, "epoch": i})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also report the throughput in terms of size?

Signed-off-by: Scott Lee <[email protected]>
@scottjlee
Copy link
Contributor Author

Also, what do we think about the current default batch size being 32? @c21 @raulchen @stephanie-wang

)
parser.add_argument(
"--batch-size",
default=32,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's GPU utiliaization looks like on dashboard?

@raulchen
Copy link
Contributor

Also, what do we think about the current default batch size being 32? @c21 @raulchen @stephanie-wang

I think the principle is that we want to maximize this number as long as the data can fit into GPU memory. Could you measure the size of each row? and then we can figure out a proper number.

@raulchen
Copy link
Contributor

@scottjlee could you also update the benchmarks results when you have them? We can merge this benchmark first, and optimize perf later.

@scottjlee
Copy link
Contributor Author

I get a strange import error after the refactoring of the common utility functions into any file outside of the benchmark file itself:

(TrainTrainable pid=350, ip=10.0.27.110) Traceback (most recent call last):
(TrainTrainable pid=350, ip=10.0.27.110)   File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/serialization.py", line 387, in deserialize_objects
(TrainTrainable pid=350, ip=10.0.27.110)     obj = self._deserialize_object(data, metadata, object_ref)
(TrainTrainable pid=350, ip=10.0.27.110)   File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/serialization.py", line 268, in _deserialize_object
(TrainTrainable pid=350, ip=10.0.27.110)     return self._deserialize_msgpack_data(data, metadata_fields)
(TrainTrainable pid=350, ip=10.0.27.110)   File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/serialization.py", line 223, in _deserialize_msgpack_data
(TrainTrainable pid=350, ip=10.0.27.110)     python_objects = self._deserialize_pickle5_data(pickle5_data)
(TrainTrainable pid=350, ip=10.0.27.110)   File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/serialization.py", line 211, in _deserialize_pickle5_data
(TrainTrainable pid=350, ip=10.0.27.110)     obj = pickle.loads(in_band, buffers=buffers)
(TrainTrainable pid=350, ip=10.0.27.110) ModuleNotFoundError: No module named 'benchmark_utils'

This doesn't happen if the methods are located directly in the file.

@scottjlee scottjlee added the do-not-merge Do not merge this PR! label Jul 22, 2023
@scottjlee scottjlee added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. and removed do-not-merge Do not merge this PR! labels Jul 25, 2023
@scottjlee
Copy link
Contributor Author

Successful release test run with both tests passing:
https://buildkite.com/ray-project/release-tests-pr/builds/46436

@c21 c21 merged commit 1dbd7c1 into ray-project:master Jul 25, 2023
1 of 2 checks passed
c21 pushed a commit that referenced this pull request Jul 31, 2023
As a followup to #37624, add the following additional parameters for the multi-node training benchmark:
- File type (image, parquet)
- local shuffle buffer size
- preserve_order (train config)
- increases default # epochs to 10

Signed-off-by: Scott Lee <[email protected]>
NripeshN pushed a commit to NripeshN/ray that referenced this pull request Aug 15, 2023
Add a benchmark / release test for multi-node scenario involving the following steps:
- load images from S3 with `ray.data.read_images`
- apply preprocessing with `map_batches`
- training with `TorchTrainer`

In a sample run of the benchmark, we can see the throughput for the steps above:
```
{'ray.data+transform': 2.9862405760130937, 'ray.TorchTrainer.fit': 1.3551558704631792}
```

Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: NripeshN <[email protected]>
NripeshN pushed a commit to NripeshN/ray that referenced this pull request Aug 15, 2023
As a followup to ray-project#37624, add the following additional parameters for the multi-node training benchmark:
- File type (image, parquet)
- local shuffle buffer size
- preserve_order (train config)
- increases default # epochs to 10

Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: NripeshN <[email protected]>
harborn pushed a commit to harborn/ray that referenced this pull request Aug 17, 2023
Add a benchmark / release test for multi-node scenario involving the following steps:
- load images from S3 with `ray.data.read_images`
- apply preprocessing with `map_batches`
- training with `TorchTrainer`

In a sample run of the benchmark, we can see the throughput for the steps above:
```
{'ray.data+transform': 2.9862405760130937, 'ray.TorchTrainer.fit': 1.3551558704631792}
```

Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: harborn <[email protected]>
harborn pushed a commit to harborn/ray that referenced this pull request Aug 17, 2023
As a followup to ray-project#37624, add the following additional parameters for the multi-node training benchmark:
- File type (image, parquet)
- local shuffle buffer size
- preserve_order (train config)
- increases default # epochs to 10

Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: harborn <[email protected]>
harborn pushed a commit to harborn/ray that referenced this pull request Aug 17, 2023
Add a benchmark / release test for multi-node scenario involving the following steps:
- load images from S3 with `ray.data.read_images`
- apply preprocessing with `map_batches`
- training with `TorchTrainer`

In a sample run of the benchmark, we can see the throughput for the steps above:
```
{'ray.data+transform': 2.9862405760130937, 'ray.TorchTrainer.fit': 1.3551558704631792}
```

Signed-off-by: Scott Lee <[email protected]>
harborn pushed a commit to harborn/ray that referenced this pull request Aug 17, 2023
As a followup to ray-project#37624, add the following additional parameters for the multi-node training benchmark:
- File type (image, parquet)
- local shuffle buffer size
- preserve_order (train config)
- increases default # epochs to 10

Signed-off-by: Scott Lee <[email protected]>
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
Add a benchmark / release test for multi-node scenario involving the following steps:
- load images from S3 with `ray.data.read_images`
- apply preprocessing with `map_batches`
- training with `TorchTrainer`

In a sample run of the benchmark, we can see the throughput for the steps above:
```
{'ray.data+transform': 2.9862405760130937, 'ray.TorchTrainer.fit': 1.3551558704631792}
```

Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: e428265 <[email protected]>
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
As a followup to ray-project#37624, add the following additional parameters for the multi-node training benchmark:
- File type (image, parquet)
- local shuffle buffer size
- preserve_order (train config)
- increases default # epochs to 10

Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: e428265 <[email protected]>
vymao pushed a commit to vymao/ray that referenced this pull request Oct 11, 2023
Add a benchmark / release test for multi-node scenario involving the following steps:
- load images from S3 with `ray.data.read_images`
- apply preprocessing with `map_batches`
- training with `TorchTrainer`

In a sample run of the benchmark, we can see the throughput for the steps above:
```
{'ray.data+transform': 2.9862405760130937, 'ray.TorchTrainer.fit': 1.3551558704631792}
```

Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Victor <[email protected]>
vymao pushed a commit to vymao/ray that referenced this pull request Oct 11, 2023
As a followup to ray-project#37624, add the following additional parameters for the multi-node training benchmark:
- File type (image, parquet)
- local shuffle buffer size
- preserve_order (train config)
- increases default # epochs to 10

Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Victor <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Data] Add large-scale training data ingestion benchmarks to release tests
4 participants