Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Why has the performance of d3 improved so much? #1

Open
hhaAndroid opened this issue Dec 6, 2023 · 3 comments
Open

[Question] Why has the performance of d3 improved so much? #1

hhaAndroid opened this issue Dec 6, 2023 · 3 comments

Comments

@hhaAndroid
Copy link

This is a fantastic job, and I have a question: why has the performance of d3's dataset improved so much? It seems relatively reasonable for other datasets to show improvement. I look forward to your response.

@shenyunhang
Copy link
Owner

Thanks for your interest in our work.

I think the main reason is we have constructed some negative queries during training visual grounding data, which is described in the last paragraph of Sec. 3.2.

  1. It compensates for the loss of fine-grained information in sentence-level embedding.
  2. It makes the model learn to reject irrelevant prompts.

We also construct Image-centri Grounding Samples (Sec. 3.3), and the model will learn all objects in an image described by sentences simultaneously, which could also improve performance.

@Masaaki-75
Copy link

By mentioning "reject" and "negative", do you mean that techniques like contrastive learning are used?

If not, then I am a bit confused. Because, intuively, concatenating the postive language queries (describing objects in the images) with the negative ones (describing objects that don't exist) and then making it interact with visual features, is like introducing noise in the features, right?

Without contrastive loss or other manipulation, how could the model explicitly learn to reject irrelevant prompts, and get higher performance? Please correct me if I am misunderstanding.

@shenyunhang
Copy link
Owner

We believe the model will learn to denoise as we use noisy tokens for fusion and supervise it with ground-truth signals.

As we formulate grounding as detection, all prompts can be seen as object classes.
When the model is trained in the detection way, it will learn to predict low scores for negative classes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants