Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Larger model #6

Open
yixuan-qiao opened this issue Feb 7, 2022 · 4 comments
Open

Larger model #6

yixuan-qiao opened this issue Feb 7, 2022 · 4 comments

Comments

@yixuan-qiao
Copy link

Awesome work, thanks for releasing!

Is there some plans to further release larger models, such as BLIP-large or BLIP-xxlarge?

@LiJunnan1992
Copy link
Contributor

Hi, we have released a larger model which use ViT-L as the vision encoder (the text encoder is still bert-base). Currently we do not have plans to train models that are larger than that.

Thanks!

@christophschuhmann
Copy link

In case you change your mind, we from LAION can provide compute & have 6B yet unreleased image-text-pairs, 2.3B english.
( https://laion.ai )

We are currently busy with preparing the training of CLIP-versions, but we could just scale the ViT & LM up with the existing code and cooperate on pulling off the training.

Btw, here is a colab with pretty impressive captioning results i got with BLIP with many cannidate captions and filtering with CLIP ViT L & ResNet 50x64 https://colab.research.google.com/drive/1fKxiDMa-9uu1A6XiYjxTbYxSagvbZ8Fb?usp=sharing

@LiJunnan1992
Copy link
Contributor

Hi @christophschuhmann, it would be great if we can cooperate to train larger BLIP models with our code and your data & compute. I am very interested to continue this discussion.

Thanks for the colab, the captions do look nice!

@christophschuhmann
Copy link

Awesome! :)

We mostly use discord for correspondence. My handle is: spirit-from-germany#1488

Here is an invite link to the server we work on:

https://discord.gg/AAwcPAw894

For the Image captioning and VQA stuff, we use the channel #image-captioning.

Let's chat there :)

Btw, here are some VQA results we recently got with a frozen CLIP ViT L 14 and a frozen GPT J and a trained mapping transformer in between:

image
image

image
image
image
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants