Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When taking videos as input #18

Open
hshustc opened this issue Jun 13, 2019 · 3 comments
Open

When taking videos as input #18

hshustc opened this issue Jun 13, 2019 · 3 comments

Comments

@hshustc
Copy link

hshustc commented Jun 13, 2019

When taking videos input, the feature maps in each layer have four dimensions, i.e., THW*C. Are the attention maps are still query-independent? Could you please give more details? Thanks a lot.

@tea1528
Copy link

tea1528 commented Jul 1, 2019

Up-vote for this question. I am really interested whether the attention maps in video task showed similar result like object detection task.

Kinda think that temporal dimension should have some more importance over spatial dimensions.

@xvjiarui
Copy link
Owner

xvjiarui commented Jul 3, 2019

Sorry for the late reply

The attention across time is relative hard to visualize.

From the Table 1 in the paper, the attention on Kinetics seems to be a little more query dependent than COCO.
We will leave it as a future work.

@JJBOY
Copy link

JJBOY commented Jul 10, 2019

In my experiments in video classification,non local moudle is not query-independent. How about you guys' results?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants