You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, I have seen that any of encode_image, _encode_image or forward methods return img_latents and img_embeds in 768 dimension; this means after the last projection layer. However, in the /open_clip/model_configs/coca_ViT-L-14.json file you specify that the width of the vision encoder is 1024. I have 2 concerns:
Why is the img_embeds size (1, 255, 768) for one image if there should be 256 patches?
How can I get the raw embeddings after the vision encoder of size 1024?
Thanks!
The text was updated successfully, but these errors were encountered:
Hello, I have seen that any of
encode_image
,_encode_image
orforward
methods returnimg_latents
andimg_embeds
in 768 dimension; this means after the last projection layer. However, in the/open_clip/model_configs/coca_ViT-L-14.json
file you specify that the width of the vision encoder is 1024. I have 2 concerns:img_embeds
size (1, 255, 768) for one image if there should be 256 patches?Thanks!
The text was updated successfully, but these errors were encountered: