Vision Transformers for Dense Prediction #242

brunosan · 2024-04-30T09:58:13Z

brunosan
Apr 30, 2024
Maintainer

We introduce dense vision transformers, an architecture that leverages vision transformers in place of convolutional
networks as a backbone for dense prediction tasks. We assemble tokens from various stages of the vision transformer
into image-like representations at various resolutions and progressively combine them into full-resolution predictions
using a convolutional decoder. The transformer backbone processes representations at a constant and relatively high
resolution and has a global receptive field at every stage. These properties allow the dense vision transformer to provide finer-grained and more globally coherent predictions when compared to fully-convolutional networks. Our experiments show that this architecture yields substantial improvements on dense prediction tasks, especially when a large amount of training data is available. For monocular depth estimation, we observe an improvement of up to 28% in relative performance when compared to a state-of-theart fully-convolutional network. When applied to semantic segmentation, dense vision transformers set a new state of the art on ADE20K with 49.02% mIoU. We further show that the architecture can be fine-tuned on smaller datasets such as NYUv2, KITTI, and Pascal Context where it also sets the new state of the art.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vision Transformers for Dense Prediction #242

{{title}}

Replies: 0 comments

Select a reply

Vision Transformers for Dense Prediction #242

brunosan Apr 30, 2024 Maintainer

Replies: 0 comments

brunosan
Apr 30, 2024
Maintainer