Inquiry on the "gated cross-modality interaction" #27

Masaaki-75 · 2024-01-19T06:52:27Z

Hi! Thanks for open-sourcing APE, it is fantastic! 👍

I am new to the field of open-vocabulary vision foundation models, and I have some questions on the "gated cross-modality interaction" when going through your paper, hoping to seek your insights on a few points.

I understand that the interaction of image features and text features in GLIP causes expensive computation. But I couldn't get the part of "all-zero token", quoted:

Instead, an all-zero token Pzero serves as a special text embedding and inputs to the fusion module for all given vocabularies. In this situation, the fusion process is “static”, as no language information is injected into vision features. The Pzero could provide explicit instructions to recognize primitive concepts and slightly tune vision feature Vvoc and retain original language feature Pvoc.

How does it work? I mean, how does an all-zero token provide instructions to recognize concepts?
In this paragraph, it seems that this token is only applied for word prompts, while deprecated for sentence prompts? But in Figure 2, the zero token is interacting with sentence prompts. Am I missing something?
Where is the corresponding code for Pzero? Is it https://github.com/shenyunhang/APE/blob/main/ape/modeling/ape_deta/deformable_detr_segm.py#L220 ?

shenyunhang · 2024-02-01T05:44:09Z

Sorry for this late response.

As the all-zero token is different from other text tokens, it does not provide any text information, so the model may be awarded to perform OVD and OVS tasks.
we only use this token for vocabulary prompts, but this token can also be used with sentence prompts, which has no effect.
deformable_detr_segm.py is the no-fusion model, fusion model is deformable_detr_segm_vl.py,
The all-zero token is self.name_prompt_fusion_feature. The corresponding code is here: https://github.com/shenyunhang/APE/blob/main/ape/modeling/ape_deta/deformable_detr_segm_vl.py#L158

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inquiry on the "gated cross-modality interaction" #27

Inquiry on the "gated cross-modality interaction" #27

Masaaki-75 commented Jan 19, 2024

shenyunhang commented Feb 1, 2024 •

edited

Loading

Inquiry on the "gated cross-modality interaction" #27

Inquiry on the "gated cross-modality interaction" #27

Comments

Masaaki-75 commented Jan 19, 2024

shenyunhang commented Feb 1, 2024 • edited Loading

shenyunhang commented Feb 1, 2024 •

edited

Loading