Stars
Improving Visual Grounding with Visual-Linguistic Verification and Iterative Reasoning, CVPR 2022
Load and visualize different datasets in video question answering
code for our CVPR2020 paper "IMRAM: Iterative Matching with Recurrent Attention Memory for Cross-Modal Image-Text Retrieval"
Cross-Modal Relation-Aware Networks for Audio-Visual Event Localization, ACM MM 2020
Code for CVPR 2021 paper: Context-aware Biaffine Localizing Network for Temporal Sentence Grounding
Global-Local Temporal Representations For Video Person Re-Identification
Source code for IJCAI 2020 paper "A Relation-Specific Attention Network for Joint Entity and Relation Extraction"
Code and dataset for ACL 2021 paper "How Knowledge Graph and Attention Help?A Quantitative Analysis into Bag-level Relation Extraction".
Code for CVPR 19 Paper "Improving Referring Expression Grounding with Cross-modal Attention-guided Erasing"
Code for the Scene Graph Generation part of CVPR 2019 oral paper: "Learning to Compose Dynamic Tree Structures for Visual Contexts"
NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions (CVPR'21)
Spatial-Temporal Transformer for Dynamic Scene Graph Generation, ICCV2021
Exploring Self-attention for Image Recognition, CVPR2020.
CVPR2021: Temporal Context Aggregation Network for Temporal Action Proposal Refinement
[CVPR 2021] Pytorch implementation for Probabilistic Modeling of Semantic Ambiguity for Scene Graph Generation
Code for NIPS 2018 paper, "Chain of Reasoning for Visual Question Answering"
Learning Long-term Visual Dynamics with Region Proposal Interaction Networks (ICLR 2021)
This repo contains the pytorch implementation for Dynamic Concept Learner (accepted by ICLR 2021).
Pytorch Implementation of "Object level Visual Reasoning in Videos", F. Baradel, N. Neverova, C. Wolf, J. Mille, G. Mori , ECCV 2018
Visual Coreference Resolution in Visual Dialog using Neural Module Networks
Source codes for #ACL2021 paper "From Discourse to Narrative: Knowledge Projection for Event Relation Extraction"
CSC 594: Human-like Visual Question Answering with Multimodal Transformers