Skip to content
/ SPT Public

[TCSVT23] Official code for "SPT: Spatial Pyramid Transformer for Image Captioning".

License

Notifications You must be signed in to change notification settings

zchoi/SPT

Repository files navigation

SPT: Spatial Pyramid Transformer for Image Captioning

[Paper] | TCSVT 2023

This is the code implementation of the paper "SPT: Spatial Pyramid Transformer for Image Captioning", the checkpoint and feature will be released soon.

Overview

The canonical approaches to image captioning tend to vision transformers to learn sentence generation. These methods typically treat visually representative modeling of an image as a sequential problem (i.e., flatting image patches), which demonstrates impressive levels of performance. However, the spatial semantic loss for flattened grid features of images has not received much attention to date. Besides, the routine of the current transformer models tend to maintain a full-length patch sequence during training and inference, which lacks hierarchal representation and makes it difficult to generate sentences with multiple levels of granularity. To this end, we propose a Spatially Pyramidal Transformer (SPT), which progressively pools vision patches to shrink sequence length for caption generation with varying graininess among image grids.


Figure 1. Overview of the Spatial Pyramid Transformer (SPT) for Image Captioning.

Dataset and Training Details

Note

For the data preparation, feature download, and training details, please refer to this Repo.

About

[TCSVT23] Official code for "SPT: Spatial Pyramid Transformer for Image Captioning".

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published