Skip to content

A complete easy to follow implementation of Google's Vision Transformer proposed in "AN IMAGE IS WORTH 16X16 WORDS". This pytorch implementation has comments for better understanding.

Notifications You must be signed in to change notification settings

tahmid0007/VisionTransformer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 

Repository files navigation

VisionTransformer (Pytorch)

A complete easy to follow implementation of Google's Vision Transformer proposed in "AN IMAGE IS WORTH 16X16 WORDS". This pytorch implementation has comments for better understanding.

An image is worth 16x16 words: transformers for image recognition at scale: Find the original paper here.

  • This Pytorch Implementation is based on This repo but follows the original paper more closely in terms of the first patch embedding and initializations. The default dataset used here is CIFAR10 which can be easily changed to ImageNet or anything else.
  • You might need to install einops.
  • According to the paper, if you are training from scratch, accuracy might not match state of the art CNNs like ResNet. Pretrain on a larger dataset to exploit the full potential of Vision transformer.
  • Easy to understand commnets are available in the code for better understanding, specially for beginners in attention and transformer models.
  • The standalone script "Google_ViT" is sufficient to run this code.

About

A complete easy to follow implementation of Google's Vision Transformer proposed in "AN IMAGE IS WORTH 16X16 WORDS". This pytorch implementation has comments for better understanding.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages