Question about AnyRes calculations: bug or intended? #59

zucchini-nlp · 2024-06-13T08:22:56Z

Hello LLaVa-NeXT team!

I want to clarify some points about the AnyRes technique and how the image feature is unpadded in modeling forward.

As this issue shows, seems like a get_anyres_image_grid_shape function is returning height and width swapped, which in turn results in the shortest edge after unpadding being less than 24.

For more info, below is the code in transformers we adopted from your repo. Function get_anyres_image_grid_shape returns height as first element, but later it's mapped to num_patch_width. In this scenario if we have an image of 2x1 aspect ratio, the final image_feature after unpadding gets size (4096, 24, smth-smaller than 24) which seems incorrect. The question is: if it's a bug or intended behavior?

Btw, I tried inference both ways, by swapping height and width back. Didn't see actual difference as the chosen images prob were too easy, and the model was attending more to the base-image features

def get_anyres_image_grid_shape(image_size, grid_pinpoints, patch_size):
    height, width = select_best_resolution(image_size, grid_pinpoints)
    return height // patch_size, width // patch_size

# later in modeling code (same as in https://github.com/LLaVA-VL/LLaVA-NeXT/blob/6944062b9bb2e61c48436f1a65c3ea339095ec91/llava/model/llava_arch.py#L200-L201)
num_patch_width, num_patch_height = get_anyres_image_grid_shape(
    image_size,
    image_grid_pinpoints,
    vision_config.image_size,
)
image_feature = image_feature.view(num_patch_height, num_patch_width, height, width, -1)
image_feature = image_feature.permute(4, 0, 2, 1, 3).contiguous()
image_feature = image_feature.flatten(1, 2).flatten(2, 3)
image_feature = unpad_image(image_feature, image_sizes[image_idx])

The text was updated successfully, but these errors were encountered:

zucchini-nlp · 2024-06-24T12:27:20Z

Sorry, not resolved. The issue is that in LLaVa-VL during pre-processing it's assumed image is (width, height) in size which aligns with another assumption, that best-resolutions are also (width, height). But then in modeling, the size of an image embedding is assumed to be height-first, thus causing confusion and incorrect upadding.

image_feature = image_feature.view(num_patch_height, num_patch_width, height, width, -1)

So, let me know if it's a bug or intended please :)

zucchini-nlp closed this as completed Jun 21, 2024

zucchini-nlp reopened this Jun 24, 2024

zucchini-nlp mentioned this issue Jun 25, 2024

Incorrect docstring of get_anyres_image_grid_shape huggingface/transformers#31588

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about AnyRes calculations: bug or intended? #59

Question about AnyRes calculations: bug or intended? #59

zucchini-nlp commented Jun 13, 2024

zucchini-nlp commented Jun 24, 2024

Question about AnyRes calculations: bug or intended? #59

Question about AnyRes calculations: bug or intended? #59

Comments

zucchini-nlp commented Jun 13, 2024

zucchini-nlp commented Jun 24, 2024