Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bigger models release ? #2

Closed
rom1504 opened this issue Jan 6, 2021 · 19 comments
Closed

Bigger models release ? #2

rom1504 opened this issue Jan 6, 2021 · 19 comments

Comments

@rom1504
Copy link
Contributor

rom1504 commented Jan 6, 2021

Hi,
Thanks for these amazing results and for releasing the code and ViT-B/32 weights!
Do you plan to also release the 3 bigger models you mention in the paper ?

@jongwook
Copy link
Collaborator

jongwook commented Jan 7, 2021

Hi! We will be releasing the RN50 model soon, but we haven't decided when/whether we will release the other models. I hope to tell you good news in the near future!

@SeanPedersen
Copy link

Hope to see the visual transformer models included as well!

@ekCSU
Copy link

ekCSU commented Jan 10, 2021

It really helps the research community (especially with lower budget) to be able to try out the state of the art ML. CLIP is a simple and elegant idea that many applications/research can enjoy it. But the smallest released model is just not performing very well. We greatly appreciate if OpenAI releases the larger models. Thanks.

@thoppe
Copy link

thoppe commented Jan 11, 2021

Agreed with @ekCSU ! I have a bunch of things I'd like to test, namely the power of zero-shot inference... Will the larger model do better with questions like these?

https://twitter.com/metasemantic/status/1348113145609465856

@EelcoHoogendoorn
Copy link

Another upvote. Contrary to what ekCSU said above the vit/B32 has been impressive in terms of its ability to generalise to weird domains already. Personally I am particularly interested in a pretrained hybrid model, which uses a convolutional backbone but with a transformer rather than maxpool afterwards. What the paper shows in my perception is that you 'can' train SOTA models using transformers alone if you have the compute and data, but not that its necessarily the most efficient or natural choice. Seems to me that the tiling boundaries in a pure vision transformer must lead to funny/suboptimal behavior at some level. Curious to see how those intuitions play out in my problem domain.

@woctezuma
Copy link

It sounds to me like: https://github.com/CompVis/taming-transformers

@EelcoHoogendoorn
Copy link

EelcoHoogendoorn commented Mar 6, 2021

It sounds to me like: https://github.com/CompVis/taming-transformers

The original 'AN IMAGE IS WORTH 16X16 WORDS' paper investigates them under the label of 'hybrid' models; and if you look at fig 5, you can see they offer the best performance / training FLOPS tradeoff. Pure transformer reaches the highest absolute performance but also with a much bigger compute budget. The experiments ive seen reported on give no indication that a hybrid shouldnt be able to keep up if given the bigger compute budget as well. For me the takeaway of the ViT paper isnt 'lets do away with convolutions completely'. Yeah they demonstrate that you can, at least for purposes of classification, which is cool and all from a theoretical pov; but not that you should.

For most applications and indeed for generative purposes, a hybrid transformer-convolution architecture seems much more sensible to me. Attention is great and all but that image tiling mechanism just seems completely unnatural ; and one might as well get the benefit of attentive reasoning on the higher level feature maps. Conv filters do a fine job of detecting edges and assembling them into higher level features. Transformers would be a great tool on top of that to check if cats ears are actually sitting on top of its head and all that. At least thats based on theoretical reasoning and what little direct comparisons ive seen so I could be wrong; but thats why id love to try for myself.

@EelcoHoogendoorn
Copy link

The BoTNet paper is also trashtalking pure transformers pretty hard; though I think they basically demonstrate the same point mostly; that without a CLIP-like training pure transformers lack the data-efficiency to be trained well. Sadly they do not seem to directly address the question how a hybrid transformer as per the ViT paper actually compares to their shuffling around of the transformer and 1x1/dense layers, as they schematically contrast in fig 3. Given how obvious a comparison that is, we can safely assume the implication is that the hybrid ViT turns out to be (marginally) superior.

@thomasbkahn
Copy link

Hi! We will be releasing the RN50 model soon, but we haven't decided when/whether we will release the other models. I hope to tell you good news in the near future!

Any updates on this? Are you able to share anything about what is factoring into the decision? I would love to try out the larger models.

@fractaldna22
Copy link

Thank you for releasing b/16, it seems to be very interesting for guiding image generative models, more coherent and achieves finer detail and more defined shapes. I would LOVE to try it with vit-H-14 though

@fractaldna22
Copy link

fractaldna22 commented Jul 21, 2021

is there any advice on how to modify one of the imagenet pretrained vits to work as a placeholder until its officially released? Not sure if that's possible, but if it is i would love to know how. I've tried my best but the different naming of the folder structure of the pretrained vit from google vision and the clip model are just too different to get anywhere fast. I can wait for the official clip version though.

@jongwook
Copy link
Collaborator

A similar approach would be to learn a layer or two on top of one the official ViT models (or any other vision models) to align with CLIP's feature space. No idea how well/better it would work though. It should be easier and more flexible to directly use one of the pytorch image models than retrofitting other models into CLIP's existing VisionTransformer class, since tiny implementation details may differ.

@thomasbkahn
Copy link

FYI to those watching it looks like 2 bigger models were released in a recent commit. Thank you @jongwook!

@fractaldna22
Copy link

fractaldna22 commented Jul 25, 2021 via email

@fractaldna22
Copy link

fractaldna22 commented Jul 25, 2021 via email

@fractaldna22
Copy link

fractaldna22 commented Jul 25, 2021

For example if it looks like a water drop is falling towards a surface it goes ahead and makes it splash, even though it wasn't necessarily in the prompt. It takes its own creative liberties and makes faces blink, albeit asynchronously between one eye and the other . I assume this is coming from clip but who knows.. this is why I'm so interested in what a model with a 1gb weight will do with having seen so much.

I'm sure it could animate entire sequences and much more than simply "label probs" lol

or-toledano added a commit to or-toledano/CLIP that referenced this issue Aug 9, 2021
jongwook pushed a commit to or-toledano/CLIP that referenced this issue Aug 9, 2021
jongwook added a commit that referenced this issue Aug 9, 2021
* Can specify root directory when loading model

* specifying download_root instead

* Update Prompt_Engineering_for_ImageNet.ipynb

Fix bug caused by changing default to jit=False with handling the case jit=True as well

* Reduce size of diff

* Reduce size of diff #2

* Reduce size of diff #3

* updated Interacting_with_CLIP.ipynb

* update Prompt_Engineering_for_ImageNet.ipynb

Co-authored-by: kcosta42 <[email protected]>
Co-authored-by: Jong Wook Kim <[email protected]>
@daboe01
Copy link

daboe01 commented Mar 1, 2022

https://twitter.com/casualganpapers/status/1490318575873241091

@woctezuma
Copy link

woctezuma commented Mar 1, 2022

Nice! Thank you for the heads-up!

OpenAI stealth released the model weights for the largest CLIP models: RN50x64 & ViT-L/14

Change the model name from ViT-B/16 to ViT-L/14 when you load the checkpoint to enjoy this beefed-up version of CLIP!

#MachineLearning #generativeart #vqganclip #generative #AI

Pic

@jongwook
Copy link
Collaborator

Fixed in #234

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants