Skip to content

MiscellaneousStuff/vision-transformer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Vision Transformer

About

Re-implementation of Vision Transformer

TODO

  • Image Preprocessing
  • Patching
  • Flattening
  • Linear Projection
  • Position Embedding
  • CLS Token
  • Transformer
    • Attention
    • Feedforward
  • MLP Head Classifier

Datasets

  • MNIST (98%)
  • CIFAR-10 (75.25%)
  • Tiny ImageNet (44%) (Same param count as ResNet18)

Model

Arch

  • Patch + Position Embedding (Extra learnable class embedding)
  • Linear Projection of Flattened Patches
  • Transformer Encoder
  • MLP Head (Contains GeLU)
  • Class (Bird, Ball, Car)

Transformer Encoder

  • Embedded Patches
  • Norm (Layer Norm?)
  • MHA
  • Add (Identity + Activation)
  • Norm (Layer Norm?)
  • MLP
  • Add (Identity + Activation)

Image Handling

  • Reshape image x { R(HxWxC) into eq of flattened 2d patches xp { R(N*(P2 * C)) where (H, W) is res, C is channel count, (P, P) is res of each image patch, N = HW / P2 is resulting number of patches, also servces as effective input seq len for Transformer. Transformer uses constant latent vector size D through all layers, patches flattened to map to D dimensions with trainable linear projection, output of this projection are patch embeddings.

About

Re-implementation of Vision Transformer (ViT)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages