Skip to content
/ cnm Public

Weakly Supervised Video Moment Localisation with Contrastive Negative Sample Mining

Notifications You must be signed in to change notification settings

minghangz/cnm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CNM: Weakly Supervised Video Moment Localization with Contrastive Negative Sample Mining

In this work, we study the problem of video moment localization with natural language query and propose a novel weakly suervised solution by introducing Contrastive Negative sample Mining (CNM). Specifically, we use a learnable Gaussian mask to generate positive samples, highlighting the video frames most related to the query, and consider other frames of the video and the whole video as easy and hard negative samples respectively. We then train our network with the Intra-Video Contrastive loss to make our positive and negative samples more discriminative.

Our paper was accepted by AAAI-2022. [Paper] [Project Page]

Pipeline

pipeline

Main Results

ActivityNet Captions Dataset

IoU=0.1 IoU=0.3 IoU=0.5 mIoU url feature
78.13 55.68 33.33 37.14 model CLIP
79.74 54.61 30.26 36.59 model C3D

Charades-STA Dataset

IoU=0.3 IoU=0.5 IoU=0.7 mIoU url feature
60.04 35.15 14.95 38.11 model I3D

Requiments

  • pytorch
  • h5py
  • nltk
  • fairseq

Quick Start

Data Preparation

Please download the visual features from here and save it to the data/ folder. We expect the directory structure to be the following:

data
├── activitynet
│   ├── clip_vit_32_features.hdf5
│   ├── glove.pkl
│   ├── train_data.json
│   ├── val_data.json
│   ├── test_data.json
├── charades
│   ├── i3d_features.hdf5
│   ├── glove.pkl
│   ├── train.json
│   ├── test.json

We extract the CLIP feature every 8 frames for ActivityNet Captions dataset. We use the I3D feature provided by LGI and use this script to convert the file format to HDF5. We also provide the results when training with the C3D feature, whose performance is slightly lower than the CLIP feature. If you would like to use the C3D feature, please download from here and save as data/activitynet/c3d_features.hdf5.

Training

To train on the ActivityNet Captions dataset:

# With CLIP feature
python train.py --config-path config/activitynet/clip_feature.json --log_dir LOG_DIR --tag TAG
# With C3D feature
python train.py --config-path config/activitynet/c3d_feature.json --log_dir LOG_DIR --tag TAG

To train on the Charades-STA dataset:

python train.py --config-path config/charades/i3d_features.json --log_dir LOG_DIR --tag TAG

Use --log_dir to specify the directory where the logs are saved, and use --tag to identify each experiment. They are both optional.

The model weights are saved in checkpoints/ by default and can be modified in the configuration file.

Inference

Our trained model are provided in checkpoints/. Run the following commands for evaluation:

python train.py --config-path CONFIG_FILE --resume CHECKPOINT_FILE --eval

The configuration file is the same as training.

Acknowledege

We appreciate SCN for its implementation with semantic completion network.

About

Weakly Supervised Video Moment Localisation with Contrastive Negative Sample Mining

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages