SPT: Spatial Pyramid Transformer for Image Captioning

Haonan Zhang, Pengpeng Zeng, Lianli Gao, Xinyu Lyu, Jingkuan Song, Heng Tao Shen,

[Paper] | TCSVT 2023

This is the code implementation of the paper "SPT: Spatial Pyramid Transformer for Image Captioning", the checkpoint and feature will be released soon.

Overview

The canonical approaches to image captioning tend to vision transformers to learn sentence generation. These methods typically treat visually representative modeling of an image as a sequential problem (i.e., flatting image patches), which demonstrates impressive levels of performance. However, the spatial semantic loss for flattened grid features of images has not received much attention to date. Besides, the routine of the current transformer models tend to maintain a full-length patch sequence during training and inference, which lacks hierarchal representation and makes it difficult to generate sentences with multiple levels of granularity. To this end, we propose a Spatially Pyramidal Transformer (SPT), which progressively pools vision patches to shrink sequence length for caption generation with varying graininess among image grids.

Figure 1. Overview of the Spatial Pyramid Transformer (SPT) for Image Captioning.

Dataset and Training Details

Note

For the data preparation, feature download, and training details, please refer to this Repo.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
.github/workflows		.github/workflows
__pycache__		__pycache__
coco-caption		coco-caption
config		config
decoding		decoding
misc		misc
models		models
pretreatment		pretreatment
results_backup		results_backup
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
dataloader.py		dataloader.py
framework.png		framework.png
opts.py		opts.py
prepare_corpora.py		prepare_corpora.py
train.py		train.py
translate.py		translate.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SPT: Spatial Pyramid Transformer for Image Captioning

Haonan Zhang, Pengpeng Zeng, Lianli Gao, Xinyu Lyu, Jingkuan Song, Heng Tao Shen,

Overview

Dataset and Training Details

About

Releases

Packages

Contributors 2

Languages

License

zchoi/SPT

Folders and files

Latest commit

History

Repository files navigation

SPT: Spatial Pyramid Transformer for Image Captioning

Haonan Zhang, Pengpeng Zeng, Lianli Gao, Xinyu Lyu, Jingkuan Song, Heng Tao Shen,

Overview

Dataset and Training Details

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages