Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Datasets] Improve size estimation of image folder data source #27219

Merged
merged 6 commits into from
Aug 12, 2022

Conversation

c21
Copy link
Contributor

@c21 c21 commented Jul 28, 2022

Signed-off-by: Cheng Su [email protected]

Why are these changes needed?

This PR is to improve in-memory data size estimation of image folder data source. Before this PR, we use on-disk file size as estimation of in-memory data size of image folder data source. This can be inaccurate due to image compression and in-memory image resizing.

Given size and mode is set to be optional in #27295, so change this PR to tackle the simple case when size and mode are both provided.

  • size and mode is provided: just calculate the in-memory size based on the dimensions, not need to read any image (this PR)
  • size or mode is not provided: need sampling to determine the in-memory size (will do in another followup PR).

Here is example of estiamted size for our test image folder

>>> import ray
>>> from ray.data.datasource.image_folder_datasource import ImageFolderDatasource
>>> root = "example:https://image-folders/different-sizes"
>>> ds = ray.data.read_datasource(ImageFolderDatasource(), root=root, size=(64, 64), mode="RGB")
>>> ds.size_bytes()
40310
>>> ds.fully_executed().size_bytes()
37428

Without this PR:

>>> ds.size_bytes()
18978

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@c21
Copy link
Contributor Author

c21 commented Jul 28, 2022

The added unit test of this PR depends on #27156. But this PR itself is also ready for review.

non_empty_path_and_size = list(
filter(lambda p: p[1] > 0, zip(self._paths, self._file_sizes))
)
assert len(non_empty_path_and_size) > 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we raise ValueError or return the default estimate in this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ericl - sure, return default value as 1 (to use on-disk file size) in this case. We anyway will throw exception in other places when input data source is empty.

python/ray/data/datasource/image_folder_datasource.py Outdated Show resolved Hide resolved
python/ray/data/datasource/image_folder_datasource.py Outdated Show resolved Hide resolved
@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 28, 2022
@c21 c21 removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 28, 2022
Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So actually, since size is required, can't we statically compute the image size based on the resize dimensions? WHdtype right? No need to perform a disk IO.

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 29, 2022
@c21
Copy link
Contributor Author

c21 commented Jul 29, 2022

So actually, since size is required, can't we statically compute the image size based on the resize dimensions? WHdtype right? No need to perform a disk IO.

@ericl - that's what I was planning to do in first place. But with size, we only know the width and height. We don't know the 3rd dimension - whether it's a RGB or grayscale image, and we have to assume each element in image matrix is a uint8 or something.

self._reader_args.get("root"),
self._reader_args.get("size"),
)
single_image_memory_size = single_image_block.memory_usage().sum()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@clarkzinzow - good catch, I think we should, let me add it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@clarkzinzow - added.

@clarkzinzow
Copy link
Contributor

clarkzinzow commented Jul 29, 2022

@c21 We at least know that the data will always be uint8 bytes:

image = skimage.util.img_as_ubyte(image)

As for the channel dimension, can't you just do 8 * np.prod(size), and that should work generically for any image shapes, e.g. (H, W, C)?

Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, I guess this seems like a reasonable approach if we have to decode the image anyway.

@ericl
Copy link
Contributor

ericl commented Jul 29, 2022

/tmp/artifacts/.failed_test_logs/python::ray::data::tests::test_dataset_formats.py::test_image_folder_reader_estimate_data_size[64-30000-5]_1659048762.8525.zip
 

@clarkzinzow
Copy link
Contributor

Ah it looks like it's valid for the user to only pass (H, W) while skimage will preserve the channels... That's unfortunate. https://scikit-image.org/docs/dev/api/skimage.transform.html#skimage.transform.resize

@c21
Copy link
Contributor Author

c21 commented Jul 29, 2022

As for the channel dimension, can't you just do 8 * np.prod(size), and that should work generically for any image shapes, e.g. (H, W, C)?

@clarkzinzow - I may miss something, but how do you know the 3rd dimension of image - whether it's a RGB (h, w, 3)or grayscale (h, w, 1)?

@c21
Copy link
Contributor Author

c21 commented Jul 29, 2022

@ericl - yeah the unit test is failing now, because this PR depends on #27156 (#27219 (comment)). I tested locally it passed when I stacked this PR on top of the other one. But github does not allow publishing stacked PRs.

@c21 c21 removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 29, 2022
@c21 c21 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 29, 2022
@c21
Copy link
Contributor Author

c21 commented Jul 29, 2022

Hold off on this, as more discussion around image resizing is still on-going. cc @bveeramani, @clarkzinzow and @amogkam

@c21
Copy link
Contributor Author

c21 commented Aug 10, 2022

Given size and mode is set to be optional in #27295, so change this PR to tackle the simple case when size and mode are both provided.

  • size and mode is provided: just calculate the in-memory size based on the dimensions, not need to read any image (this PR)
  • size or mode is not provided: need sampling to determine the in-memory size (will do in another followup PR).

@c21 c21 removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Aug 10, 2022
@c21
Copy link
Contributor Author

c21 commented Aug 10, 2022

@ericl, @jianoaix, @clarkzinzow - could you help take a look when you have time? Thanks.


sampling_duration = time.perf_counter() - start_time
if sampling_duration > 5:
logger.info(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a warning message if this is considered a bit too long?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jianoaix - make sense, updated.

@ericl ericl merged commit 7c7828f into ray-project:master Aug 12, 2022
@c21 c21 deleted the image branch August 12, 2022 22:20
gramhagen pushed a commit to gramhagen/ray that referenced this pull request Aug 15, 2022
…roject#27219)

This PR is to improve in-memory data size estimation of image folder data source. Before this PR, we use on-disk file size as estimation of in-memory data size of image folder data source. This can be inaccurate due to image compression and in-memory image resizing.

Given `size` and `mode` is set to be optional in ray-project#27295, so change this PR to tackle the simple case when `size` and `mode` are both provided.
* `size` and `mode` is provided: just calculate the in-memory size based on the dimensions, not need to read any image (this PR)
* `size` or `mode` is not provided: need sampling to determine the in-memory size (will do in another followup PR).

Here is example of estiamted size for our test image folder

```
>>> import ray
>>> from ray.data.datasource.image_folder_datasource import ImageFolderDatasource
>>> root = "example:https://image-folders/different-sizes"
>>> ds = ray.data.read_datasource(ImageFolderDatasource(), root=root, size=(64, 64), mode="RGB")
>>> ds.size_bytes()
40310
>>> ds.fully_executed().size_bytes()
37428
```

Without this PR:

```
>>> ds.size_bytes()
18978
```
Stefan-1313 pushed a commit to Stefan-1313/ray_mod that referenced this pull request Aug 18, 2022
…roject#27219)

This PR is to improve in-memory data size estimation of image folder data source. Before this PR, we use on-disk file size as estimation of in-memory data size of image folder data source. This can be inaccurate due to image compression and in-memory image resizing.

Given `size` and `mode` is set to be optional in ray-project#27295, so change this PR to tackle the simple case when `size` and `mode` are both provided.
* `size` and `mode` is provided: just calculate the in-memory size based on the dimensions, not need to read any image (this PR)
* `size` or `mode` is not provided: need sampling to determine the in-memory size (will do in another followup PR).

Here is example of estiamted size for our test image folder

```
>>> import ray
>>> from ray.data.datasource.image_folder_datasource import ImageFolderDatasource
>>> root = "example:https://image-folders/different-sizes"
>>> ds = ray.data.read_datasource(ImageFolderDatasource(), root=root, size=(64, 64), mode="RGB")
>>> ds.size_bytes()
40310
>>> ds.fully_executed().size_bytes()
37428
```

Without this PR:

```
>>> ds.size_bytes()
18978
```

Signed-off-by: Stefan van der Kleij <[email protected]>
JiahaoYao pushed a commit to JiahaoYao/ray that referenced this pull request Aug 21, 2022
…roject#27219)

This PR is to improve in-memory data size estimation of image folder data source. Before this PR, we use on-disk file size as estimation of in-memory data size of image folder data source. This can be inaccurate due to image compression and in-memory image resizing.

Given `size` and `mode` is set to be optional in ray-project#27295, so change this PR to tackle the simple case when `size` and `mode` are both provided.
* `size` and `mode` is provided: just calculate the in-memory size based on the dimensions, not need to read any image (this PR)
* `size` or `mode` is not provided: need sampling to determine the in-memory size (will do in another followup PR).

Here is example of estiamted size for our test image folder

```
>>> import ray
>>> from ray.data.datasource.image_folder_datasource import ImageFolderDatasource
>>> root = "example:https://image-folders/different-sizes"
>>> ds = ray.data.read_datasource(ImageFolderDatasource(), root=root, size=(64, 64), mode="RGB")
>>> ds.size_bytes()
40310
>>> ds.fully_executed().size_bytes()
37428
```

Without this PR:

```
>>> ds.size_bytes()
18978
```
JiahaoYao pushed a commit to JiahaoYao/ray that referenced this pull request Aug 22, 2022
…roject#27219)

This PR is to improve in-memory data size estimation of image folder data source. Before this PR, we use on-disk file size as estimation of in-memory data size of image folder data source. This can be inaccurate due to image compression and in-memory image resizing.

Given `size` and `mode` is set to be optional in ray-project#27295, so change this PR to tackle the simple case when `size` and `mode` are both provided.
* `size` and `mode` is provided: just calculate the in-memory size based on the dimensions, not need to read any image (this PR)
* `size` or `mode` is not provided: need sampling to determine the in-memory size (will do in another followup PR).

Here is example of estiamted size for our test image folder

```
>>> import ray
>>> from ray.data.datasource.image_folder_datasource import ImageFolderDatasource
>>> root = "example:https://image-folders/different-sizes"
>>> ds = ray.data.read_datasource(ImageFolderDatasource(), root=root, size=(64, 64), mode="RGB")
>>> ds.size_bytes()
40310
>>> ds.fully_executed().size_bytes()
37428
```

Without this PR:

```
>>> ds.size_bytes()
18978
```
JiahaoYao pushed a commit to JiahaoYao/ray that referenced this pull request Aug 22, 2022
…roject#27219)

This PR is to improve in-memory data size estimation of image folder data source. Before this PR, we use on-disk file size as estimation of in-memory data size of image folder data source. This can be inaccurate due to image compression and in-memory image resizing.

Given `size` and `mode` is set to be optional in ray-project#27295, so change this PR to tackle the simple case when `size` and `mode` are both provided.
* `size` and `mode` is provided: just calculate the in-memory size based on the dimensions, not need to read any image (this PR)
* `size` or `mode` is not provided: need sampling to determine the in-memory size (will do in another followup PR).

Here is example of estiamted size for our test image folder

```
>>> import ray
>>> from ray.data.datasource.image_folder_datasource import ImageFolderDatasource
>>> root = "example:https://image-folders/different-sizes"
>>> ds = ray.data.read_datasource(ImageFolderDatasource(), root=root, size=(64, 64), mode="RGB")
>>> ds.size_bytes()
40310
>>> ds.fully_executed().size_bytes()
37428
```

Without this PR:

```
>>> ds.size_bytes()
18978
```
ArturNiederfahrenhorst pushed a commit to ArturNiederfahrenhorst/ray that referenced this pull request Sep 1, 2022
…roject#27219)

This PR is to improve in-memory data size estimation of image folder data source. Before this PR, we use on-disk file size as estimation of in-memory data size of image folder data source. This can be inaccurate due to image compression and in-memory image resizing.

Given `size` and `mode` is set to be optional in ray-project#27295, so change this PR to tackle the simple case when `size` and `mode` are both provided.
* `size` and `mode` is provided: just calculate the in-memory size based on the dimensions, not need to read any image (this PR)
* `size` or `mode` is not provided: need sampling to determine the in-memory size (will do in another followup PR).

Here is example of estiamted size for our test image folder

```
>>> import ray
>>> from ray.data.datasource.image_folder_datasource import ImageFolderDatasource
>>> root = "example:https://image-folders/different-sizes"
>>> ds = ray.data.read_datasource(ImageFolderDatasource(), root=root, size=(64, 64), mode="RGB")
>>> ds.size_bytes()
40310
>>> ds.fully_executed().size_bytes()
37428
```

Without this PR:

```
>>> ds.size_bytes()
18978
```

Signed-off-by: Artur Niederfahrenhorst <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants