Open Set Action Recognition via Multi-Label Evidential Learning |
|
|
|
FLAG3D: A 3D Fitness Activity Dataset with Language Instruction |
|
|
|
MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Action Recognition |
|
|
|
The Wisdom of Crowds: Temporal Progressive Attention for Early Action Prediction |
|
|
|
Use Your Head: Improving Long-Tail Video Recognition |
|
|
|
Decomposed Cross-Modal Distillation for RGB-based Temporal Action Detection |
➖ |
|
|
Video Test-Time Adaptation for Action Recognition |
|
|
|
How Can Objects Help Action Recognition? |
|
|
|
Text-Visual Prompting for Efficient 2D Temporal Video Grounding |
|
|
|
Enlarging Instance-Specific and Class-Specific Information for Open-Set Action Recognition |
|
|
|
TimeBalance: Temporally-Invariant and Temporally-Distinctive Video Representations for Semi-Supervised Action Recognition |
|
|
|
Learning Video Representations from Large Language Models |
|
|
|
Fine-tuned CLIP Models are Efficient Video Learners |
|
|
|
Efficient Movie Scene Detection Using State-Space Transformers |
|
|
|
AdamsFormer for Spatial Action Localization in the Future |
➖ |
|
|
A Light Weight Model for Active Speaker Detection |
|
|
|
System-Status-Aware Adaptive Network for Online Streaming Video Understanding |
➖ |
|
|
STMixer: A One-Stage Sparse Action Detector |
|
|
|
Revisiting Temporal Modeling for CLIP-Based Image-to-Video Knowledge Transferring |
|
|
|
Distilling Vision-Language Pre-Training To Collaborate With Weakly-Supervised Temporal Action Localization |
|
|
➖ |
Real-Time Multi-Person Eyeblink Detection in the Wild for Untrimmed Video |
|
|
|
Modeling Video As Stochastic Processes for Fine-Grained Video Representation Learning |
|
|
|
Re2TAL: Rewiring Pretrained Video Backbones for Reversible Temporal Action Localization |
|
|
|
Learning Discriminative Representations for Skeleton based Action Recognition |
|
|
|
Learning Procedure-Aware Video Representation from Instructional Videos and their Narrations |
|
|
|
Collecting Cross-Modal Presence-Absence Evidence for Weakly-Supervised Audio-Visual Event Perception |
|
|
➖ |
PivoTAL: Prior-Driven Supervision for Weakly-Supervised Temporal Action Localization |
➖ |
|
|
Cascade Evidential Learning for Open-World Weakly-Supervised Temporal Action Localization |
➖ |
|
➖ |
Soft-Landing Strategy for Alleviating the Task Discrepancy Problem in Temporal Action Localization Tasks |
➖ |
|
➖ |
SVFormer: Semi-Supervised Video Transformer for Action Recognition |
|
|
|
AutoAD: Movie Description in Context |
|
|
|
STMT: A Spatial-Temporal Mesh Transformer for MoCap-based Action Recognition |
|
|
|
Boosting Weakly-Supervised Temporal Action Localization with Text Information |
|
|
|
Aligning Step-by-Step Instructional Diagrams to Video Demonstrations |
|
|
|
Improving Weakly Supervised Temporal Action Localization by Bridging Train-Test Gap in Pseudo Labels |
|
|
➖ |
Weakly Supervised Video Representation Learning with Unaligned Text for Sequential Videos |
|
|
|
Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline |
|
|
|
LOGO: A Long-Form Video Dataset for Group Action Quality Assessment |
|
|
➖ |
Search-Map-Search: A Frame Selection Paradigm for Action Recognition |
➖ |
|
|
3Mformer: Multi-Order Multi-Mode Transformer for Skeletal Action Recognition |
➖ |
|
|
ProTeGe: Untrimmed Pretraining for Video Temporal Grounding by Video Temporal Grounding |
➖ |
|
➖ |
Egocentric Video Task Translation |
|
|
|
Look Around for Anomalies: Weakly-Supervised Anomaly Detection via Context-Motion Relational Learning |
➖ |
|
|
Proposal-based Multiple Instance Learning for Weakly-Supervised Temporal Action Localization |
|
|
|
TriDet: Temporal Action Detection with Relative Boundary Modeling |
|
|
|
Actionlet-Dependent Contrastive Learning for Unsupervised Skeleton-based Action Recognition |
|
|
|
EVAL: Explainable Video Anomaly Localization |
➖ |
|
|
Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning |
|
|
➖ |
Weakly Supervised Temporal Sentence Grounding with Uncertainty-guided Self-Training |
➖ |
|
|
Leveraging Temporal Context in Low Representational Power Regimes |
|
|
|
PIVOT: Prompting for Video Continual Learning |
➖ |
|
➖ |
On the Benefits of 3D Pose and Tracking for Human Action Recognition |
|
|
➖ |
NaQ: Leveraging Narrations as Queries to Supervise Episodic Memory |
|
|
➖ |
Selective Structured State-Spaces for Long-Form Video Understanding |
➖ |
|
|
Frame Flexible Network |
|
|
|
ASPnet: Action Segmentation with Shared-Private Representation of Multiple Data Sources |
➖ |
|
➖ |
Unified Keypoint-based Action Recognition Framework via Structured Keypoint Pooling |
➖ |
|
➖ |
Learning Transferable Spatiotemporal Representations from Natural Script Knowledge |
|
|
|
Masked Video Distillation: Rethinking Masked Feature Modeling for Self-Supervised Video Representation Learning |
|
|
|
Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models |
|
|
|
Procedure-Aware Pretraining for Instructional Video Understanding |
|
|
|
Latency Matters: Real-Time Action Forecasting Transformer |
|
|
|
Generating Anomalies for Video Anomaly Detection with Prompt-based Feature Mapping |
➖ |
|
|
HierVL: Learning Hierarchical Video-Language Embeddings |
|
|
|
Two-Stream Networks for Weakly-Supervised Temporal Action Localization with Semantic-Aware Mechanisms |
➖ |
|
|
Hybrid Active Learning via Deep Clustering for Video Action Detection |
|
|
|
Prompt-guided Zero-Shot Anomaly Action Recognition using Pretrained Deep Skeleton Features |
➖ |
|
➖ |
Unbiased Multiple Instance Learning for Weakly Supervised Video Anomaly Detection |
|
|
|
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking |
|
|
➖ |
PDPP: Projected Diffusion for Procedure Planning in Instructional Videos |
|
|
|
Learning Action Changes by Measuring Verb-Adverb Textual Relationships |
|
|
|
Reducing the Label Bias for Timestamp Supervised Temporal Action Segmentation |
➖ |
|
|
Video Event Restoration based on Keyframes for Video Anomaly Detection |
➖ |
|
|
Active Exploration of Multimodal Complementarity for Few-Shot Action Recognition |
➖ |
|
|
Vita-CLIP: Video and Text Adaptive CLIP via Multimodal Prompting |
|
|
➖ |
Post-Processing Temporal Action Detection |
|
|
|
Relational Space-Time Query in Long-Form Videos |
➖ |
|
|
Therbligs in Action: Video Understanding through Motion Primitives |
➖ |
|
➖ |
Dual-Path Adaptation from Image to Video Transformers |
|
|
➖ |
Hierarchical Semantic Contrast for Scene-Aware Video Anomaly Detection |
|
|
➖ |
Exploiting Completeness and Uncertainty of Pseudo Labels for Weakly Supervised Video Anomaly Detection |
|
|
|
Unbiased Scene Graph Generation in Videos |
|
|
|