CNM: Weakly Supervised Video Moment Localization with Contrastive Negative Sample Mining

In this work, we study the problem of video moment localization with natural language query and propose a novel weakly suervised solution by introducing Contrastive Negative sample Mining (CNM). Specifically, we use a learnable Gaussian mask to generate positive samples, highlighting the video frames most related to the query, and consider other frames of the video and the whole video as easy and hard negative samples respectively. We then train our network with the Intra-Video Contrastive loss to make our positive and negative samples more discriminative.

Our paper was accepted by AAAI-2022. [Paper] [Project Page]

Pipeline

Main Results

ActivityNet Captions Dataset

IoU=0.1	IoU=0.3	IoU=0.5	mIoU	url	feature
78.13	55.68	33.33	37.14	model	CLIP
79.74	54.61	30.26	36.59	model	C3D

Charades-STA Dataset

IoU=0.3	IoU=0.5	IoU=0.7	mIoU	url	feature
60.04	35.15	14.95	38.11	model	I3D

Requiments

pytorch
h5py
nltk
fairseq

Quick Start

Data Preparation

Please download the visual features from here and save it to the data/ folder. We expect the directory structure to be the following:

data
├── activitynet
│   ├── clip_vit_32_features.hdf5
│   ├── glove.pkl
│   ├── train_data.json
│   ├── val_data.json
│   ├── test_data.json
├── charades
│   ├── i3d_features.hdf5
│   ├── glove.pkl
│   ├── train.json
│   ├── test.json

We extract the CLIP feature every 8 frames for ActivityNet Captions dataset. We use the I3D feature provided by LGI and use this script to convert the file format to HDF5. We also provide the results when training with the C3D feature, whose performance is slightly lower than the CLIP feature. If you would like to use the C3D feature, please download from here and save as data/activitynet/c3d_features.hdf5.

Training

To train on the ActivityNet Captions dataset:

# With CLIP feature
python train.py --config-path config/activitynet/clip_feature.json --log_dir LOG_DIR --tag TAG
# With C3D feature
python train.py --config-path config/activitynet/c3d_feature.json --log_dir LOG_DIR --tag TAG

To train on the Charades-STA dataset:

python train.py --config-path config/charades/i3d_features.json --log_dir LOG_DIR --tag TAG

Use --log_dir to specify the directory where the logs are saved, and use --tag to identify each experiment. They are both optional.

The model weights are saved in checkpoints/ by default and can be modified in the configuration file.

Inference

Our trained model are provided in checkpoints/. Run the following commands for evaluation:

python train.py --config-path CONFIG_FILE --resume CHECKPOINT_FILE --eval

The configuration file is the same as training.

Acknowledege

We appreciate SCN for its implementation with semantic completion network.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
checkpoints		checkpoints
config		config
data		data
datasets		datasets
imgs		imgs
models		models
optimizers		optimizers
runners		runners
.gitignore		.gitignore
README.md		README.md
train.py		train.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CNM: Weakly Supervised Video Moment Localization with Contrastive Negative Sample Mining

Pipeline

Main Results

ActivityNet Captions Dataset

Charades-STA Dataset

Requiments

Quick Start

Data Preparation

Training

Inference

Acknowledege

About

Releases

Packages

Languages

minghangz/cnm

Folders and files

Latest commit

History

Repository files navigation

CNM: Weakly Supervised Video Moment Localization with Contrastive Negative Sample Mining

Pipeline

Main Results

ActivityNet Captions Dataset

Charades-STA Dataset

Requiments

Quick Start

Data Preparation

Training

Inference

Acknowledege

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages