Bigger models release ? #2

rom1504 · 2021-01-06T02:08:34Z

Hi,
Thanks for these amazing results and for releasing the code and ViT-B/32 weights!
Do you plan to also release the 3 bigger models you mention in the paper ?

jongwook · 2021-01-07T00:17:48Z

Hi! We will be releasing the RN50 model soon, but we haven't decided when/whether we will release the other models. I hope to tell you good news in the near future!

SeanPedersen · 2021-01-08T20:54:38Z

Hope to see the visual transformer models included as well!

ekCSU · 2021-01-10T23:25:15Z

It really helps the research community (especially with lower budget) to be able to try out the state of the art ML. CLIP is a simple and elegant idea that many applications/research can enjoy it. But the smallest released model is just not performing very well. We greatly appreciate if OpenAI releases the larger models. Thanks.

thoppe · 2021-01-11T03:42:31Z

Agreed with @ekCSU ! I have a bunch of things I'd like to test, namely the power of zero-shot inference... Will the larger model do better with questions like these?

https://twitter.com/metasemantic/status/1348113145609465856

EelcoHoogendoorn · 2021-03-05T14:17:18Z

Another upvote. Contrary to what ekCSU said above the vit/B32 has been impressive in terms of its ability to generalise to weird domains already. Personally I am particularly interested in a pretrained hybrid model, which uses a convolutional backbone but with a transformer rather than maxpool afterwards. What the paper shows in my perception is that you 'can' train SOTA models using transformers alone if you have the compute and data, but not that its necessarily the most efficient or natural choice. Seems to me that the tiling boundaries in a pure vision transformer must lead to funny/suboptimal behavior at some level. Curious to see how those intuitions play out in my problem domain.

woctezuma · 2021-03-06T07:22:41Z

It sounds to me like: https://github.com/CompVis/taming-transformers

EelcoHoogendoorn · 2021-03-06T09:29:59Z

It sounds to me like: https://github.com/CompVis/taming-transformers

The original 'AN IMAGE IS WORTH 16X16 WORDS' paper investigates them under the label of 'hybrid' models; and if you look at fig 5, you can see they offer the best performance / training FLOPS tradeoff. Pure transformer reaches the highest absolute performance but also with a much bigger compute budget. The experiments ive seen reported on give no indication that a hybrid shouldnt be able to keep up if given the bigger compute budget as well. For me the takeaway of the ViT paper isnt 'lets do away with convolutions completely'. Yeah they demonstrate that you can, at least for purposes of classification, which is cool and all from a theoretical pov; but not that you should.

For most applications and indeed for generative purposes, a hybrid transformer-convolution architecture seems much more sensible to me. Attention is great and all but that image tiling mechanism just seems completely unnatural ; and one might as well get the benefit of attentive reasoning on the higher level feature maps. Conv filters do a fine job of detecting edges and assembling them into higher level features. Transformers would be a great tool on top of that to check if cats ears are actually sitting on top of its head and all that. At least thats based on theoretical reasoning and what little direct comparisons ive seen so I could be wrong; but thats why id love to try for myself.

EelcoHoogendoorn · 2021-03-16T18:57:12Z

The BoTNet paper is also trashtalking pure transformers pretty hard; though I think they basically demonstrate the same point mostly; that without a CLIP-like training pure transformers lack the data-efficiency to be trained well. Sadly they do not seem to directly address the question how a hybrid transformer as per the ViT paper actually compares to their shuffling around of the transformer and 1x1/dense layers, as they schematically contrast in fig 3. Given how obvious a comparison that is, we can safely assume the implication is that the hybrid ViT turns out to be (marginally) superior.

thomasbkahn · 2021-07-09T15:52:19Z

Hi! We will be releasing the RN50 model soon, but we haven't decided when/whether we will release the other models. I hope to tell you good news in the near future!

Any updates on this? Are you able to share anything about what is factoring into the decision? I would love to try out the larger models.

fractaldna22 · 2021-07-21T15:14:26Z

Thank you for releasing b/16, it seems to be very interesting for guiding image generative models, more coherent and achieves finer detail and more defined shapes. I would LOVE to try it with vit-H-14 though

fractaldna22 · 2021-07-21T15:19:54Z

is there any advice on how to modify one of the imagenet pretrained vits to work as a placeholder until its officially released? Not sure if that's possible, but if it is i would love to know how. I've tried my best but the different naming of the folder structure of the pretrained vit from google vision and the clip model are just too different to get anywhere fast. I can wait for the official clip version though.

jongwook · 2021-07-21T20:24:54Z

A similar approach would be to learn a layer or two on top of one the official ViT models (or any other vision models) to align with CLIP's feature space. No idea how well/better it would work though. It should be easier and more flexible to directly use one of the pytorch image models than retrofitting other models into CLIP's existing VisionTransformer class, since tiny implementation details may differ.

thomasbkahn · 2021-07-23T01:03:47Z

FYI to those watching it looks like 2 bigger models were released in a recent commit. Thank you @jongwook!

fractaldna22 · 2021-07-25T06:08:55Z

Technically, b16 is smaller in file size by 3 mb or so 😛

…

On Thu, Jul 22, 2021, 9:03 PM Thomas Kahn ***@***.***> wrote: FYI to those watching it looks like 2 bigger models were released in a recent commit <dff9d15>. Thank you @jongwook <https://github.com/jongwook>! — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#2 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AI4YF7WODK6GZ2R7UZUXMWTTZC5X5ANCNFSM4VWYIERQ> .

fractaldna22 · 2021-07-25T06:10:53Z

But still I agree thank you so much. I'm currently using a combination of b16 and b32 encodings averaged into the loss to guide vqgan parameters and the interplay between the two vision models is yielding an interesting dynamic that seems as though it's even animating things as if it knew physics.

…

On Sun, Jul 25, 2021, 2:08 AM Josh Holmes ***@***.***> wrote: Technically, b16 is smaller in file size by 3 mb or so 😛 On Thu, Jul 22, 2021, 9:03 PM Thomas Kahn ***@***.***> wrote: > FYI to those watching it looks like 2 bigger models were released in a > recent commit > <dff9d15>. > Thank you @jongwook <https://github.com/jongwook>! > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > <#2 (comment)>, or > unsubscribe > <https://github.com/notifications/unsubscribe-auth/AI4YF7WODK6GZ2R7UZUXMWTTZC5X5ANCNFSM4VWYIERQ> > . >

fractaldna22 · 2021-07-25T06:21:21Z

For example if it looks like a water drop is falling towards a surface it goes ahead and makes it splash, even though it wasn't necessarily in the prompt. It takes its own creative liberties and makes faces blink, albeit asynchronously between one eye and the other . I assume this is coming from clip but who knows.. this is why I'm so interested in what a model with a 1gb weight will do with having seen so much.

I'm sure it could animate entire sequences and much more than simply "label probs" lol

* Can specify root directory when loading model * specifying download_root instead * Update Prompt_Engineering_for_ImageNet.ipynb Fix bug caused by changing default to jit=False with handling the case jit=True as well * Reduce size of diff * Reduce size of diff #2 * Reduce size of diff #3 * updated Interacting_with_CLIP.ipynb * update Prompt_Engineering_for_ImageNet.ipynb Co-authored-by: kcosta42 <[email protected]> Co-authored-by: Jong Wook Kim <[email protected]>

daboe01 · 2022-03-01T17:25:17Z

https://twitter.com/casualganpapers/status/1490318575873241091

woctezuma · 2022-03-01T19:47:26Z

Nice! Thank you for the heads-up!

OpenAI stealth released the model weights for the largest CLIP models: RN50x64 & ViT-L/14

Change the model name from ViT-B/16 to ViT-L/14 when you load the checkpoint to enjoy this beefed-up version of CLIP!

#MachineLearning #generativeart #vqganclip #generative #AI

jongwook · 2022-04-21T23:46:45Z

Fixed in #234

jongwook mentioned this issue Mar 22, 2021

Pretrained ViT-B/16? #61

Closed

jongwook mentioned this issue Apr 8, 2021

ViT H/14 / ViT L/16/32 #77

Closed

uahsan3 mentioned this issue Jun 21, 2021

CLIP Training Code #83

Open

or-toledano added a commit to or-toledano/CLIP that referenced this issue Aug 9, 2021

Reduce size of diff openai#2

e9ed4c2

jongwook pushed a commit to or-toledano/CLIP that referenced this issue Aug 9, 2021

Reduce size of diff openai#2

311a078

jongwook closed this as completed Apr 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bigger models release ? #2

Bigger models release ? #2

rom1504 commented Jan 6, 2021

jongwook commented Jan 7, 2021

SeanPedersen commented Jan 8, 2021

ekCSU commented Jan 10, 2021

thoppe commented Jan 11, 2021

EelcoHoogendoorn commented Mar 5, 2021

woctezuma commented Mar 6, 2021

EelcoHoogendoorn commented Mar 6, 2021 •

edited

Loading

EelcoHoogendoorn commented Mar 16, 2021

thomasbkahn commented Jul 9, 2021

fractaldna22 commented Jul 21, 2021

fractaldna22 commented Jul 21, 2021 •

edited

Loading

jongwook commented Jul 21, 2021

thomasbkahn commented Jul 23, 2021

fractaldna22 commented Jul 25, 2021 via email

fractaldna22 commented Jul 25, 2021 via email •

edited

Loading

fractaldna22 commented Jul 25, 2021 •

edited

Loading

daboe01 commented Mar 1, 2022

woctezuma commented Mar 1, 2022 •

edited

Loading

jongwook commented Apr 21, 2022

Bigger models release ? #2

Bigger models release ? #2

Comments

rom1504 commented Jan 6, 2021

jongwook commented Jan 7, 2021

SeanPedersen commented Jan 8, 2021

ekCSU commented Jan 10, 2021

thoppe commented Jan 11, 2021

EelcoHoogendoorn commented Mar 5, 2021

woctezuma commented Mar 6, 2021

EelcoHoogendoorn commented Mar 6, 2021 • edited Loading

EelcoHoogendoorn commented Mar 16, 2021

thomasbkahn commented Jul 9, 2021

fractaldna22 commented Jul 21, 2021

fractaldna22 commented Jul 21, 2021 • edited Loading

jongwook commented Jul 21, 2021

thomasbkahn commented Jul 23, 2021

fractaldna22 commented Jul 25, 2021 via email

fractaldna22 commented Jul 25, 2021 via email • edited Loading

fractaldna22 commented Jul 25, 2021 • edited Loading

daboe01 commented Mar 1, 2022

woctezuma commented Mar 1, 2022 • edited Loading

jongwook commented Apr 21, 2022

EelcoHoogendoorn commented Mar 6, 2021 •

edited

Loading

fractaldna22 commented Jul 21, 2021 •

edited

Loading

fractaldna22 commented Jul 25, 2021 via email •

edited

Loading

fractaldna22 commented Jul 25, 2021 •

edited

Loading

woctezuma commented Mar 1, 2022 •

edited

Loading