-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Datasets] Improve size estimation of image folder data source #27219
Conversation
The added unit test of this PR depends on #27156. But this PR itself is also ready for review. |
non_empty_path_and_size = list( | ||
filter(lambda p: p[1] > 0, zip(self._paths, self._file_sizes)) | ||
) | ||
assert len(non_empty_path_and_size) > 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we raise ValueError or return the default estimate in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ericl - sure, return default value as 1 (to use on-disk file size) in this case. We anyway will throw exception in other places when input data source is empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So actually, since size is required, can't we statically compute the image size based on the resize dimensions? WHdtype right? No need to perform a disk IO.
@ericl - that's what I was planning to do in first place. But with |
self._reader_args.get("root"), | ||
self._reader_args.get("size"), | ||
) | ||
single_image_memory_size = single_image_block.memory_usage().sum() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we need deep=True
here? https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.memory_usage.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@clarkzinzow - good catch, I think we should, let me add it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@clarkzinzow - added.
@c21 We at least know that the data will always be uint8 bytes:
As for the channel dimension, can't you just do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, I guess this seems like a reasonable approach if we have to decode the image anyway.
/tmp/artifacts/.failed_test_logs/python::ray::data::tests::test_dataset_formats.py::test_image_folder_reader_estimate_data_size[64-30000-5]_1659048762.8525.zip |
Ah it looks like it's valid for the user to only pass |
@clarkzinzow - I may miss something, but how do you know the 3rd dimension of image - whether it's a RGB |
@ericl - yeah the unit test is failing now, because this PR depends on #27156 (#27219 (comment)). I tested locally it passed when I stacked this PR on top of the other one. But github does not allow publishing stacked PRs. |
Hold off on this, as more discussion around image resizing is still on-going. cc @bveeramani, @clarkzinzow and @amogkam |
Given
|
@ericl, @jianoaix, @clarkzinzow - could you help take a look when you have time? Thanks. |
|
||
sampling_duration = time.perf_counter() - start_time | ||
if sampling_duration > 5: | ||
logger.info( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be a warning message if this is considered a bit too long?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jianoaix - make sense, updated.
Signed-off-by: Cheng Su <[email protected]>
Signed-off-by: Cheng Su <[email protected]>
Signed-off-by: Cheng Su <[email protected]>
Signed-off-by: Cheng Su <[email protected]>
Signed-off-by: Cheng Su <[email protected]>
Signed-off-by: Cheng Su <[email protected]>
…roject#27219) This PR is to improve in-memory data size estimation of image folder data source. Before this PR, we use on-disk file size as estimation of in-memory data size of image folder data source. This can be inaccurate due to image compression and in-memory image resizing. Given `size` and `mode` is set to be optional in ray-project#27295, so change this PR to tackle the simple case when `size` and `mode` are both provided. * `size` and `mode` is provided: just calculate the in-memory size based on the dimensions, not need to read any image (this PR) * `size` or `mode` is not provided: need sampling to determine the in-memory size (will do in another followup PR). Here is example of estiamted size for our test image folder ``` >>> import ray >>> from ray.data.datasource.image_folder_datasource import ImageFolderDatasource >>> root = "example:https://image-folders/different-sizes" >>> ds = ray.data.read_datasource(ImageFolderDatasource(), root=root, size=(64, 64), mode="RGB") >>> ds.size_bytes() 40310 >>> ds.fully_executed().size_bytes() 37428 ``` Without this PR: ``` >>> ds.size_bytes() 18978 ```
…roject#27219) This PR is to improve in-memory data size estimation of image folder data source. Before this PR, we use on-disk file size as estimation of in-memory data size of image folder data source. This can be inaccurate due to image compression and in-memory image resizing. Given `size` and `mode` is set to be optional in ray-project#27295, so change this PR to tackle the simple case when `size` and `mode` are both provided. * `size` and `mode` is provided: just calculate the in-memory size based on the dimensions, not need to read any image (this PR) * `size` or `mode` is not provided: need sampling to determine the in-memory size (will do in another followup PR). Here is example of estiamted size for our test image folder ``` >>> import ray >>> from ray.data.datasource.image_folder_datasource import ImageFolderDatasource >>> root = "example:https://image-folders/different-sizes" >>> ds = ray.data.read_datasource(ImageFolderDatasource(), root=root, size=(64, 64), mode="RGB") >>> ds.size_bytes() 40310 >>> ds.fully_executed().size_bytes() 37428 ``` Without this PR: ``` >>> ds.size_bytes() 18978 ``` Signed-off-by: Stefan van der Kleij <[email protected]>
…roject#27219) This PR is to improve in-memory data size estimation of image folder data source. Before this PR, we use on-disk file size as estimation of in-memory data size of image folder data source. This can be inaccurate due to image compression and in-memory image resizing. Given `size` and `mode` is set to be optional in ray-project#27295, so change this PR to tackle the simple case when `size` and `mode` are both provided. * `size` and `mode` is provided: just calculate the in-memory size based on the dimensions, not need to read any image (this PR) * `size` or `mode` is not provided: need sampling to determine the in-memory size (will do in another followup PR). Here is example of estiamted size for our test image folder ``` >>> import ray >>> from ray.data.datasource.image_folder_datasource import ImageFolderDatasource >>> root = "example:https://image-folders/different-sizes" >>> ds = ray.data.read_datasource(ImageFolderDatasource(), root=root, size=(64, 64), mode="RGB") >>> ds.size_bytes() 40310 >>> ds.fully_executed().size_bytes() 37428 ``` Without this PR: ``` >>> ds.size_bytes() 18978 ```
…roject#27219) This PR is to improve in-memory data size estimation of image folder data source. Before this PR, we use on-disk file size as estimation of in-memory data size of image folder data source. This can be inaccurate due to image compression and in-memory image resizing. Given `size` and `mode` is set to be optional in ray-project#27295, so change this PR to tackle the simple case when `size` and `mode` are both provided. * `size` and `mode` is provided: just calculate the in-memory size based on the dimensions, not need to read any image (this PR) * `size` or `mode` is not provided: need sampling to determine the in-memory size (will do in another followup PR). Here is example of estiamted size for our test image folder ``` >>> import ray >>> from ray.data.datasource.image_folder_datasource import ImageFolderDatasource >>> root = "example:https://image-folders/different-sizes" >>> ds = ray.data.read_datasource(ImageFolderDatasource(), root=root, size=(64, 64), mode="RGB") >>> ds.size_bytes() 40310 >>> ds.fully_executed().size_bytes() 37428 ``` Without this PR: ``` >>> ds.size_bytes() 18978 ```
…roject#27219) This PR is to improve in-memory data size estimation of image folder data source. Before this PR, we use on-disk file size as estimation of in-memory data size of image folder data source. This can be inaccurate due to image compression and in-memory image resizing. Given `size` and `mode` is set to be optional in ray-project#27295, so change this PR to tackle the simple case when `size` and `mode` are both provided. * `size` and `mode` is provided: just calculate the in-memory size based on the dimensions, not need to read any image (this PR) * `size` or `mode` is not provided: need sampling to determine the in-memory size (will do in another followup PR). Here is example of estiamted size for our test image folder ``` >>> import ray >>> from ray.data.datasource.image_folder_datasource import ImageFolderDatasource >>> root = "example:https://image-folders/different-sizes" >>> ds = ray.data.read_datasource(ImageFolderDatasource(), root=root, size=(64, 64), mode="RGB") >>> ds.size_bytes() 40310 >>> ds.fully_executed().size_bytes() 37428 ``` Without this PR: ``` >>> ds.size_bytes() 18978 ```
…roject#27219) This PR is to improve in-memory data size estimation of image folder data source. Before this PR, we use on-disk file size as estimation of in-memory data size of image folder data source. This can be inaccurate due to image compression and in-memory image resizing. Given `size` and `mode` is set to be optional in ray-project#27295, so change this PR to tackle the simple case when `size` and `mode` are both provided. * `size` and `mode` is provided: just calculate the in-memory size based on the dimensions, not need to read any image (this PR) * `size` or `mode` is not provided: need sampling to determine the in-memory size (will do in another followup PR). Here is example of estiamted size for our test image folder ``` >>> import ray >>> from ray.data.datasource.image_folder_datasource import ImageFolderDatasource >>> root = "example:https://image-folders/different-sizes" >>> ds = ray.data.read_datasource(ImageFolderDatasource(), root=root, size=(64, 64), mode="RGB") >>> ds.size_bytes() 40310 >>> ds.fully_executed().size_bytes() 37428 ``` Without this PR: ``` >>> ds.size_bytes() 18978 ``` Signed-off-by: Artur Niederfahrenhorst <[email protected]>
Signed-off-by: Cheng Su [email protected]
Why are these changes needed?
This PR is to improve in-memory data size estimation of image folder data source. Before this PR, we use on-disk file size as estimation of in-memory data size of image folder data source. This can be inaccurate due to image compression and in-memory image resizing.
Given
size
andmode
is set to be optional in #27295, so change this PR to tackle the simple case whensize
andmode
are both provided.size
andmode
is provided: just calculate the in-memory size based on the dimensions, not need to read any image (this PR)size
ormode
is not provided: need sampling to determine the in-memory size (will do in another followup PR).Here is example of estiamted size for our test image folder
Without this PR:
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.