-
Notifications
You must be signed in to change notification settings - Fork 592
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
minimizing difference across image and text features #4
Comments
update: I've discovered ITC from Li et al. and am looking into this loss mechanic. https://arxiv.org/abs/2107.07651 |
Hi, thanks for your interest in BLIP! The feature extraction example is not a suitable feature for measuring image-text similarity.
I highly recommend you to try out the ITM score, you could either use our pre-trained model or the model that is finetuned on COCO for image-text retrieval. Let me know if you need more help or a demo code for the ITM score. Looking forward to see BLIP shine for VQGAN! |
Thanks Junnan for that explanation! Would certainly use any "text <-> image" scoring example you might have as a reference, but I can also give it a go myself over the next few days. I have a pipeline which makes it easy to "swap out" the perception models (BLIP vs CLIP or SLIP) so it will be very interesting to see visually if BLIP can capture any finer grain details from various text descriptions. |
I have updated the demo with a new section to compute image-text matching scores, using either the ITM head (w/ cross attention) or the ITC head (feature cosine similarity). Let me know if you have any questions! |
Hi @dribnet, just curious, does the ITM similarity work for VQGAN? |
Yes - we have some preliminary results but need to clean up the code a bit. Will post results here. 👍 |
We have a version of BLIP loss this we plan on adding to an upcoming release. So far in our testing the BLIP guided loss works but doesn't "outperform" CLIP on most subjects. However there can be good effects combining it with CLIP loss. So for example, starting with a baseline of a CLIP + diffusion result for "a blueprint of a steampunk submarine" that looks like this. The result with a pure BLIP model (in this case flickr base) isn't generally better: But when we combine the CLIP result with additional BLIP loss, then we often do get enhancements from the CLIP version: So a bit hand wavey - but those are our first impressions of generating imagery with ITM similarity. I think a lot of this is also based on the dataset - both the subject matter and the formatting of the captions - so perhaps if we reviewed the training sets for these models could find specific prompts that would better match the training distribution. Thanks again for including this ITM demo in your notebook as a basis for these experiments! Feel free to close the issue if you'd like and I'll followup when this is released in case anyone wants to try their own prompts out. |
Thanks for the detailed update! I have few questions:
Looking forward to the release of the code! |
Thanks for the feedback! Glad to work on this a bit more with you if you are interested. In response to your questions:
Finetuned checkpoints:
Hard to read the tea leaves on just one run, but it seems CLIP is doing better with BLIP's input. I'll can put this online in a web interface so can try out your own text sentences if you are interested.
One other thing I forgot to mention was that BLIPs larger input size (384x384) is a another nice feature relative to CLIPs 224x224 (note some CLIP models do go up to 448 - but are very memory intensive). So doing larger images might also show off some of BLIPs capabilities relative to CLIP. Here's a version of "a blueprint of a steampunk submarine" BLIP+CLIP that's closer to to this resolution and my hunch is that there are more fine details emerge thanks to BLIP's input. |
Super interesting! After checking your code, it seems that both BLIP's ITC head and ITM head are used to compute the spherical_distance_itc loss and itm_loss loss.
|
I've tested this a bit when I was implementing it. I got the best results with I kept the softmax itm loss too in my final loss. I decided to optimize the softmax though since it doesn't behave that well with logits and it felt better to "cap" it's effect. Mostly kept it in because I had spent the time implementing it... :)
https://github.com/samedii/perceptor/blob/master/perceptor/losses/blip.py
I'm getting good results with only BLIP models like this (without any CLIP models involved) |
Interesting! I guess the itm_head without softmax can indeed make optimization difficult because the logit is unbounded. The itc_loss could provide some complementary signal that regularizes the itm_loss. btw, I wrote 'itc' to represent the 'image-text contrastive' loss during pre-training :) |
Very interesting indeed, thanks for letting us try out your work! :) Yes, I'm also hoping it will sometimes improve things but it's of course very hard to evaluate if it's working that way I see, then I was fooled! :D Will change that |
Took a stab and creating "BLIP guided imagery" using VQGAN. The general idea is that you start with a reference text embedding and then steer an image to minimize the angle between its embedding and the reference text embedding. I coded this up, but there seemed to be no relationship between the encoded text and the resulting image. For example, this is a "sunset":
This leads me to believe that the features spaces are not aligned as they are in CLIP. This seems to be confirmed with the large model which has different sized vectors: the image_features are now of size 1024 while the text features are still size 768.
Is my assumption correct and if so is there a simple transformation or mapping between the feature spaces? One caveat is that I don't fully understand the shape of the returned features and so am simply extracting the first element as was done by the Feature Extraction example in the demo notebook.
The text was updated successfully, but these errors were encountered: