How to extract 1024 width patch embeddings and CLS embedding #844

alvaro-stylesage · 2024-03-21T17:05:07Z

Hello, I have seen that any of encode_image, _encode_image or forward methods return img_latents and img_embeds in 768 dimension; this means after the last projection layer. However, in the /open_clip/model_configs/coca_ViT-L-14.json file you specify that the width of the vision encoder is 1024. I have 2 concerns:

Why is the img_embeds size (1, 255, 768) for one image if there should be 256 patches?
How can I get the raw embeddings after the vision encoder of size 1024?

Thanks!

The text was updated successfully, but these errors were encountered:

rwightman · 2024-05-08T18:38:44Z

@alvaro-stylesage the coca embeds are a bit wrong... #458 (comment)

it 'works' but it's not 100% correct

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to extract 1024 width patch embeddings and CLS embedding #844

How to extract 1024 width patch embeddings and CLS embedding #844

alvaro-stylesage commented Mar 21, 2024 •

edited

Loading

rwightman commented May 8, 2024

How to extract 1024 width patch embeddings and CLS embedding #844

How to extract 1024 width patch embeddings and CLS embedding #844

Comments

alvaro-stylesage commented Mar 21, 2024 • edited Loading

rwightman commented May 8, 2024

alvaro-stylesage commented Mar 21, 2024 •

edited

Loading