Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Datasets] Add size parameter to ImageFolderDatasource #26975

Merged
merged 9 commits into from
Jul 26, 2022

Conversation

bveeramani
Copy link
Member

@bveeramani bveeramani commented Jul 25, 2022

Why are these changes needed?

If you read a folder with differently-sized images, ImageFolderDatasource errors. This PR fixes the issue by resizing images to a user-specified size.

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

python/ray/data/datasource/image_folder_datasource.py Outdated Show resolved Hide resolved
partition_filter: PathPartitionFilter = None,
**kwargs,
root: str,
size: Tuple[int, int],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to do any sanity check of the user-provided size? What's the largest image size we want to support now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a check that the size contains positive numbers.

No idea what the large image size should be, if any.

Copy link
Contributor

@c21 c21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@jiaodong jiaodong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the quick fix ! One confirmation: i remember you previously mentioned our implementation assumes folder structure is same as imagenet, is this behavior still the same ? Looks like we're making it single root path now.

Also, how would user add / apply labels to each image read with ImageFolderDatasource API ?

@bveeramani
Copy link
Member Author

thanks for the quick fix ! One confirmation: i remember you previously mentioned our implementation assumes folder structure is same as imagenet, is this behavior still the same ? Looks like we're making it single root path now.

Behavior is identical. Previously, create_reader would raise an error if paths contained more than one element.

        if len(paths) != 1:
            raise ValueError(
                "`ImageFolderDatasource` expects 1 path representing the dataset "
                f"root, but it got {len(paths)} paths instead. To fix this "
                "error, pass in a single-element list containing the dataset root "
                '(for example, `paths=["s3:https://imagenet/train"]`)'
            )

To improve UX, I've renamed paths to root and changed the type to str.

Also, how would user add / apply labels to each image read with ImageFolderDatasource API ?

By label, do you mean integer targets used for training? This datasource already creates a "label" column.

@c21
Copy link
Contributor

c21 commented Jul 26, 2022

Looks like there's some test failure in AIR. Could you help fix these? Thanks.

==================== Test output for //doc/source/ray-air/examples:torch_image_batch_pretrained:
--
  | WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 2684354560 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=9.34gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
  | Started a local Ray instance. View the dashboard at https://127.0.0.1:8265.
  | Running GPU batch prediction with 1GB data from s3:https://anonymous@air-example-data-2/1G-image-data-synthetic-raw
  | Traceback (most recent call last):
  | File "/root/.cache/bazel/_bazel_root/5fe90af4e7d1ed9fcf52f59e39e126f5/execroot/com_github_ray_project_ray/bazel-out/k8-opt/bin/doc/source/ray-air/examples/torch_image_batch_pretrained.runfiles/com_github_ray_project_ray/doc/source/ray-air/examples/torch_image_batch_pretrained.py", line 33, in <module>
  | dataset = ray.data.read_datasource(ImageFolderDatasource(), paths=[data_url])
  | File "/ray/python/ray/data/read_api.py", line 274, in read_datasource
  | _wrap_and_register_arrow_serialization_workaround(read_args),
  | File "/ray/python/ray/_private/client_mode_hook.py", line 105, in wrapper
  | return func(*args, **kwargs)
  | File "/ray/python/ray/_private/worker.py", line 2239, in get
  | raise value.as_instanceof_cause()
  | ray.exceptions.RayTaskError(TypeError): ray::_get_read_tasks() (pid=3661, ip=172.16.16.3)
  | File "/ray/python/ray/data/read_api.py", line 1136, in _get_read_tasks
  | reader = ds.create_reader(**kwargs)
  | TypeError: create_reader() got an unexpected keyword argument 'paths'
  | ================================================================================
  | (13:28:10) [79 / 100] 14 / 21 tests, 1 failed; Testing //doc/source/ray-air/examples:torch_image_example; 199s local
  | (13:31:32) [80 / 101] 15 / 21 tests, 1 failed; Testing //doc/source/ray-air/examples:torch_incremental_learning; 43s local
  | (13:35:24) [80 / 101] 15 / 21 tests, 1 failed; Testing //doc/source/ray-air/examples:torch_incremental_learning; 275s local
  | (13:39:51) [85 / 106] 20 / 21 tests, 1 failed; Testing //doc/source/ray-air/examples:xgboost_starter; 10s local

@bveeramani
Copy link
Member Author

Looks like there's some test failure in AIR. Could you help fix these? Thanks.

Okay, should be fixed.

@richardliaw
Copy link
Contributor

fix lint?

@bveeramani bveeramani added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 26, 2022
@bveeramani bveeramani removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 26, 2022
Copy link
Member

@jiaodong jiaodong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating our examples ! nit: You should be able to remove transforms.Resize(256) now given the resize is already done. Not a blocker.

@bveeramani bveeramani added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Jul 26, 2022
Copy link
Contributor

@richardliaw richardliaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stamp on docs

@ericl ericl merged commit 89f7f2a into ray-project:master Jul 26, 2022
Rohan138 pushed a commit to Rohan138/ray that referenced this pull request Jul 28, 2022
…ct#26975)

If you read a folder with differently-sized images, `ImageFolderDatasource` errors. This PR fixes the issue by resizing images to a user-specified size.

Signed-off-by: Rohan138 <[email protected]>
Stefan-1313 pushed a commit to Stefan-1313/ray_mod that referenced this pull request Aug 18, 2022
…ct#26975)

If you read a folder with differently-sized images, `ImageFolderDatasource` errors. This PR fixes the issue by resizing images to a user-specified size.

Signed-off-by: Stefan van der Kleij <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants