Our code is based on the xl-sum of huggingface transformers.
python==3.7.9
pytorch==1.7.1
torchvision==0.8.2
torchaudio==0.7.2
cudatoolkit=10.2
The visual features extraction code are mainly from image_feature_extraction [1,2].
The code of incorporating image features are mainly borrowed from Vg-gplms.
All the triplet data <image urls, article, and summary> can be downloaded here. Note that the training data of zero-shot languages are not used under the zero-shot setting.
For multi-gpu multilingual training (8 gpus), run it like this:
bash multimodal_dist_mmt5_32_ft.sh 4 11 high-resource 1.0 8 256 # high-resource for reproducing Table 1.
For single-gpu single-language training, run it like this:
bash single_lang_multimodal_train32.sh high-resource english # e.g., for training on english dataset.
For testing, run it:
bash evaluate.sh.
[1] Jaemin Cho, Jie Lei, Hao Tan, and Mohit Bansal. Unifying vision-and-language tasks via text generation. In ICML, 2021: 1931–1942.
[2] Anderson P, He X, Buehler C, et al. Bottom-up and top-down attention for image captioning and visual question answering[C]. In CVPR. 2018: 6077-6086.
@misc{https://doi.org/10.48550/arxiv.2212.07672,
doi = {10.48550/ARXIV.2212.07672},
url = {https://arxiv.org/abs/2212.07672},
author = {Liang, Yunlong and Meng, Fandong and Xu, Jinan and Wang, Jiaan and Chen, Yufeng and Zhou, Jie},
keywords = {Computer Vision and Pattern Recognition (cs.CV), Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
title = {Summary-Oriented Vision Modeling for Multimodal Abstractive Summarization},
publisher = {arXiv},
year = {2022},
copyright = {arXiv.org perpetual, non-exclusive license}
}