Code for Refined Semantic Enhancement towards Frequency Diffusion for Video Captioning published at AAAI 2023.
PyTorch Implementation of the paper:
Refined Semantic Enhancement towards Frequency Diffusion for Video Captioning
Xian Zhong, Zipeng Li, Shuqin Chen, Kui Jiang, Chen Chen, Mang Ye.
[aaai.org]
- Emphasis on token imbalance to enhance the refined semantics for video captioning.
- In the form of diffusion to learn infrequent tokens to alleviate the long-tailed problem.
- Balancing different frequent tokens by leveraging distinctive semantics.
As shown in Fig. 2, our overall framework follows the encoder-decoder structure. During the training process, Frequency-Aware Diffusion (FAD) encourages the model to add low-frequency token noise to learn its semantics. Then the diffusion features of tokens are fused with the corresponding visual features according to the cross-attention mechanism. At the head of the decoder, Divergent Semantic Supervisor (DSS) obtains distinctive semantic features by updating the gradient that adapts to the token itself. In the testing phase, only the original Transformer architecture is retained to generate captions.
conda create -n RSFD python==3.7
conda activate RSFD
pip install torch==1.6.0+cu101 torchvision==0.7.0+cu101 -f https://download.pytorch.org/whl/torch_stable.html
git clone https://github.com/lzp870/RSFD.git
cd RSFD
Organize corpora and extracted features under VC_data/
in BaiduYun Extraction code: RSFD
VC_data
└── MSRVTT
├── feats
│ ├── image_resnet101_imagenet_fps_max60.hdf5
│ └── motion_resnext101_kinetics_duration16_overlap8.hdf5
├── info_corpus.pkl
└── refs.pkl
└── Youtube2Text
├── feats
│ ├── image_resnet101_imagenet_fps_max60.hdf
│ └── motion_resnext101_kinetics_duration16_overlap8.hdf5
├── info_corpus.pkl
└── refs.pkl
python train.py --default --dataset MSRVTT --method ARB
python train.py --default --dataset MSVD --method ARB
python translate.py --default --dataset MSRVTT --method ARB
python translate.py --default --dataset MSVD --method ARB
If our research and this repository are helpful to your work, please [★star] this repo and [cite] with:
@inproceedings{DBLP:conf/aaai/ZhongLCJ0Y23,
author = {Xian Zhong and
Zipeng Li and
Shuqin Chen and
Kui Jiang and
Chen Chen and
Mang Ye},
editor = {Brian Williams and
Yiling Chen and
Jennifer Neville},
title = {Refined Semantic Enhancement towards Frequency Diffusion for Video
Captioning},
booktitle = {Thirty-Seventh {AAAI} Conference on Artificial Intelligence, {AAAI}
2023, Thirty-Fifth Conference on Innovative Applications of Artificial
Intelligence, {IAAI} 2023, Thirteenth Symposium on Educational Advances
in Artificial Intelligence, {EAAI} 2023, Washington, DC, USA, February
7-14, 2023},
pages = {3724--3732},
publisher = {{AAAI} Press},
year = {2023},
url = {https://ojs.aaai.org/index.php/AAAI/article/view/25484},
timestamp = {Sun, 30 Jul 2023 19:22:30 +0200},
biburl = {https://dblp.org/rec/conf/aaai/ZhongLCJ0Y23.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
Code of the encoder part is based on yangbang18/Non-Autoregressive-Video-Captioning.