[CLIP][IMAGE TRANSFORMS] Image transforms for clip encoder #1084

felipemello1 · 2024-06-13T03:43:13Z

Context

add a new feature
fix a bug
update tests and/or documentation
other (please add here)

Added the image transforms for clip.

Builder with the model specific values in:
torchtune/models/clip/_model_builders.py

Clip specific use of transforms in:
torchtune/models/clip/_transforms.py

Vision transforms in:
torchtune/modules/transforms/vision

Every transform is a function.

Algorithm TLDR:

Gets list of possible resolutions based on tile_size and max_num_tiles;
Finds resolutions that best fits the image;
Resizes with distortion
Pads
Normalizes
tile_crop -> output is of shape [num_tiles, 3, tile_size, tile_size]

Changelog

Added torchvision to the requirements

Test plan

Every function is covered, some indirectly. All tests pass.

run pre-commit hooks and linters (make sure you've first installed via pre-commit install)
add unit tests for any new functionality
update docstrings for any new or updated methods or classes
run unit tests via pytest tests
run recipe tests via pytest tests -m integration_test
manually run any new or modified recipes with sufficient proof of correctness
- include relevant commands and any other artifacts in this summary (pastes of loss curves, eval results, etc.)

docs

pytorch-bot · 2024-06-13T03:43:15Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1084

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pbontrager

This is a great first pass over adding all the functionality needed for high resolution CLIP image transforms. I've left a lot of comments on the structure of the transforms as you're adding the first ones to the library. After we update these I'll take a pass over the tests and docs.

torchtune/models/clip/_model_builders.py

torchtune/modules/transforms/pipelines.py

torchtune/modules/transforms/transforms.py

torchtune/modules/transforms/utils.py

torchtune/modules/transforms/pipelines.py

joecummings · 2024-06-25T01:21:14Z

torchtune/modules/transforms/vision/get_canvas_best_fit.py

+ If False, pick the canvas that minimizes downscaling, including no downscaling at all.
+ Returns:
+ Tuple[int, int]: The best resolution to fit the image into.
+ Examples:


Nit: These examples are not rendering correctly in docs.

I think you need to add newlines between Args, Returns, Examples
Or you need to add a colon

Suggested change

Examples:

Examples::

joecummings · 2024-06-25T01:21:36Z

torchtune/modules/transforms/vision/resize_with_pad.py

+ If None, will upscale up to target_size.
+ Returns:
+ torch.Tensor: The resized and padded image tensor in the format [..., H, W].
+ Examples:


nit: These examples are not rendering correctly in docs.

RdoubleA

Looks awesome overall, thanks for all the work you put into this. One overall question I had is the distinction between the CLIP transform which lives in torchtune/models and the other transforms in modules/transforms/vision. Will only the model transforms be classes with __call__ and all the transforms that live in modules/transforms/ be normal functions?

Also, what is the expected approach for a user to make a new model transform? I had originally thought there would be common image transforms that they can just Compose together in the model builder, but I see that CLIPImageTransform is a bit more involved than just piping transforms. So is this the overall user flow we want to enforce:

all general transforms are standalone functions in modules
to make a new model transform, user needs to define a class that uses the general transform functions in any way they want
then they create a builder function that parametrizes the class and can be instantiated in the config

RdoubleA · 2024-06-25T16:45:22Z

torchtune/models/clip/_model_builders.py

+from torchtune.models.clip._transforms import CLIPImageTransform
+
+def clip_vit_336_transform():
+


is there a paper ref for these numbers that we can add to the docstring?

torchtune/models/clip/_transforms.py

RdoubleA · 2024-06-25T17:13:14Z

torchtune/modules/transforms/vision/get_canvas_best_fit.py

+ If False, pick the canvas that minimizes downscaling, including no downscaling at all.
+ Returns:
+ Tuple[int, int]: The best resolution to fit the image into.
+ Examples:


I think you need to add newlines between Args, Returns, Examples
Or you need to add a colon

Suggested change

Examples:

Examples::

torchtune/modules/transforms/vision/get_canvas_best_fit.py

RdoubleA · 2024-06-25T17:19:16Z

torchtune/modules/transforms/vision/get_canvas_best_fit.py

+ >>> get_canvas_best_fit(image, possible_resolutions, resize_to_max_canvas=False)
+ (224, 448)
+
+ In the example above, we calculate the scaling factors for each possible resolution


or maybe these lines in between the code need to be tabbed back

torchtune/modules/transforms/vision/get_canvas_best_fit.py

Co-authored-by: Rafi Ayub <[email protected]>

This reverts commit 0a93a5d.

RdoubleA

No major concerns on my end - thanks for pushing this through. You might have to wait until recipe tests is fixed on main before landing this.

Also curious if we plan to add model transforms to the api_ref_models.rst, cc @pbontrager . But this can be a follow-up

pbontrager

Everything looks good here. I'll approve this, but I think you should update the check for possible_resolutions to explicitly check for None.

pbontrager · 2024-07-04T18:02:49Z

torchtune/models/clip/_transforms.py

+ ), f"Either possible_resolutions or max_num_tiles must be given. Got {possible_resolutions=} and {max_num_tiles=}"
+
+ # If possible_resolutions are not given, then calculate possible ones based on max_num_tiles
+ if not possible_resolutions and max_num_tiles:


You have to explicitly check for None, otherwise 0 or empty tuples resolve as false

If possible_resolutions is None or is empty or anything that is not a list with items, we must activate this condition to find possible_resolutions

pbontrager · 2024-07-04T18:03:14Z

torchtune/models/clip/_transforms.py

+ possible_resolutions = find_supported_resolutions(
+ max_num_tiles=max_num_tiles, tile_size=tile_size
+ )
+ else:


This else doesn't make sense to me

[CLIP][IMAGE TRANSFORMS] Image transforms for clip encoder (pytorch#1084)

lucylq · 2024-07-08T06:09:24Z

torchtune/modules/transforms/vision_utils/resize_with_pad.py

+) -> torch.Tensor:
+ """
+ Places the image at the top left of the canvas and pads with 0 the right and bottom
+ to fit to the target resolution. If target_size < image_size, it will crop the image.


if target_size < image_size, it will crop the image.

From _get_max_res_without_distortion, it seems like we always resize such that the image remains within bounds of target_size - is cropping still required?

lucylq · 2024-07-08T06:10:17Z

torchtune/modules/transforms/vision_utils/resize_with_pad.py

+ new_height = min(math.floor(original_height * scale_w), target_height)
+ else:
+ new_height = target_height
+ new_width = min(math.floor(original_width * scale_h), target_width)


Could this be simplified to:

new_width = min(math.floor(original_width * scale_h), target_width) new_height = min(math.floor(original_height * scale_w), target_height)

) Co-authored-by: Felipe Mello <[email protected]> Co-authored-by: Rafi Ayub <[email protected]>

first commit

9c2849c

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 13, 2024

added torchvision to requirements

98df8b9

felipemello1 changed the title ~~[CLIP][IMAGE TRANSFORMS] Image transforms for clip encoder~~ [CLIP][IMAGE TRANSFORMS] Image transforms for clip encoder - Do not land Jun 13, 2024

felipemello1 changed the title ~~[CLIP][IMAGE TRANSFORMS] Image transforms for clip encoder - Do not land~~ Do not land - [CLIP][IMAGE TRANSFORMS] Image transforms for clip encoder Jun 13, 2024

Felipe Mello added 4 commits June 12, 2024 21:13

update comments

b3d9d84

lint

acd166f

docstrings better examples rendering

fb542b6

lint

4c40e1a

felipemello1 changed the title ~~Do not land - [CLIP][IMAGE TRANSFORMS] Image transforms for clip encoder~~ [CLIP][IMAGE TRANSFORMS] Image transforms for clip encoder Jun 13, 2024

changed typing

3641992

felipemello1 requested review from pbontrager and kartikayk June 13, 2024 16:02

Felipe Mello added 2 commits June 13, 2024 13:54

small fix

24c9053

Merge remote-tracking branch 'upstream/main' into image_transforms

31e342a

felipemello1 changed the title ~~[CLIP][IMAGE TRANSFORMS] Image transforms for clip encoder~~ wip [CLIP][IMAGE TRANSFORMS] Image transforms for clip encoder Jun 14, 2024

felipemello1 marked this pull request as draft June 14, 2024 20:39

pbontrager reviewed Jun 14, 2024

View reviewed changes

added doc to .rst

4d07db2

kartikayk reviewed Jun 17, 2024

View reviewed changes

torchtune/modules/transforms/pipelines.py Outdated Show resolved Hide resolved

Felipe Mello added 5 commits June 21, 2024 13:45

first pass of changes after RFC

e0743be

parity check + updated docstrings

ae42334

updated tests

cb0fe3a

updated docs

1799b98

final round of docs review

b1d872e

felipemello1 changed the title ~~wip [CLIP][IMAGE TRANSFORMS] Image transforms for clip encoder~~ [CLIP][IMAGE TRANSFORMS] Image transforms for clip encoder Jun 25, 2024

joecummings reviewed Jun 25, 2024

View reviewed changes

felipemello1 marked this pull request as ready for review June 25, 2024 01:32

renamed file

23b28bf

RdoubleA reviewed Jun 25, 2024

View reviewed changes

felipemello1 and others added 10 commits July 1, 2024 10:42

Update torchtune/models/clip/_transforms.py

73435ac

Co-authored-by: Rafi Ayub <[email protected]>

Update torchtune/modules/transforms/vision/get_canvas_best_fit.py

34e6efe

Co-authored-by: Rafi Ayub <[email protected]>

Update torchtune/models/clip/_transforms.py

4a51718

Co-authored-by: Rafi Ayub <[email protected]>

Update torchtune/models/clip/_transforms.py

a176497

Co-authored-by: Rafi Ayub <[email protected]>

Update torchtune/models/clip/_transforms.py

ac6d5ae

Co-authored-by: Rafi Ayub <[email protected]>

Update torchtune/models/clip/_transforms.py

2e5ea0e

Co-authored-by: Rafi Ayub <[email protected]>

docstring update

10db82b

removed from init

0a93a5d

Revert "removed from init"

ca2508e

This reverts commit 0a93a5d.

small docstring change

63c117f

RdoubleA approved these changes Jul 1, 2024

View reviewed changes

update unit test naming

3095f34

pbontrager approved these changes Jul 4, 2024

View reviewed changes

Merge branch 'main' into image_transforms

dcad8d1

felipemello1 merged commit 06a125e into main Jul 5, 2024
28 checks passed

felipemello1 deleted the image_transforms branch July 5, 2024 19:57

felipemello1 restored the image_transforms branch July 5, 2024 19:58

Aditya-dom added a commit to Aditya-dom/torchtune that referenced this pull request Jul 6, 2024

Merge pull request #1 from pytorch/main

b03f8c3

[CLIP][IMAGE TRANSFORMS] Image transforms for clip encoder (pytorch#1084)

lucylq reviewed Jul 8, 2024

View reviewed changes

maximegmd pushed a commit to maximegmd/torchtune that referenced this pull request Jul 13, 2024

[CLIP][IMAGE TRANSFORMS] Image transforms for clip encoder (pytorch#1084

d3f53e5

) Co-authored-by: Felipe Mello <[email protected]> Co-authored-by: Rafi Ayub <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CLIP][IMAGE TRANSFORMS] Image transforms for clip encoder #1084

[CLIP][IMAGE TRANSFORMS] Image transforms for clip encoder #1084

felipemello1 commented Jun 13, 2024 •

edited

Loading

pytorch-bot bot commented Jun 13, 2024 •

edited

Loading

pbontrager left a comment

joecummings Jun 25, 2024

RdoubleA Jun 25, 2024

joecummings Jun 25, 2024

RdoubleA left a comment

RdoubleA Jun 25, 2024

RdoubleA Jun 25, 2024

RdoubleA Jun 25, 2024

RdoubleA left a comment

pbontrager left a comment

pbontrager Jul 4, 2024

felipemello1 Jul 5, 2024

felipemello1 Jul 5, 2024

pbontrager Jul 4, 2024

lucylq Jul 8, 2024

lucylq Jul 8, 2024

		from torchtune.models.clip._transforms import CLIPImageTransform

		def clip_vit_336_transform():

[CLIP][IMAGE TRANSFORMS] Image transforms for clip encoder #1084

[CLIP][IMAGE TRANSFORMS] Image transforms for clip encoder #1084

Conversation

felipemello1 commented Jun 13, 2024 • edited Loading

Context

Changelog

Test plan

pytorch-bot bot commented Jun 13, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1084

pbontrager left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RdoubleA left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RdoubleA left a comment

Choose a reason for hiding this comment

pbontrager left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

felipemello1 commented Jun 13, 2024 •

edited

Loading

pytorch-bot bot commented Jun 13, 2024 •

edited

Loading