When taking videos as input #18

hshustc · 2019-06-13T06:07:22Z

When taking videos input, the feature maps in each layer have four dimensions, i.e., THW*C. Are the attention maps are still query-independent? Could you please give more details? Thanks a lot.

tea1528 · 2019-07-01T01:51:09Z

Up-vote for this question. I am really interested whether the attention maps in video task showed similar result like object detection task.

Kinda think that temporal dimension should have some more importance over spatial dimensions.

xvjiarui · 2019-07-03T04:51:19Z

Sorry for the late reply

The attention across time is relative hard to visualize.

From the Table 1 in the paper, the attention on Kinetics seems to be a little more query dependent than COCO.
We will leave it as a future work.

JJBOY · 2019-07-10T07:28:57Z

In my experiments in video classification，non local moudle is not query-independent. How about you guys' results?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When taking videos as input #18

When taking videos as input #18

hshustc commented Jun 13, 2019

tea1528 commented Jul 1, 2019

xvjiarui commented Jul 3, 2019 •

edited

Loading

JJBOY commented Jul 10, 2019

When taking videos as input #18

When taking videos as input #18

Comments

hshustc commented Jun 13, 2019

tea1528 commented Jul 1, 2019

xvjiarui commented Jul 3, 2019 • edited Loading

JJBOY commented Jul 10, 2019

xvjiarui commented Jul 3, 2019 •

edited

Loading