Vision Transformer (ViT)

Tensorflow implementation of the Vision Transformer (ViT) presented in An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, where the authors show that Transformers applied directly to image patches and pre-trained on large datasets work really well on image classification.

Install dependencies

Create a Python 3 virtual environment and activate it:

virtualenv -p python3 venv
source ./venv/bin/activate

Next, install the required dependencies:

pip install -r requirements.txt

Train model

Start the model training by running:

python train.py --logdir path/to/log/dir

To track metrics, start Tensorboard

tensorboard --logdir path/to/log/dir

and then go to localhost:6006.

Citation

@inproceedings{
    anonymous2021an,
    title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
    author={Anonymous},
    booktitle={Submitted to International Conference on Learning Representations},
    year={2021},
    url={https://openreview.net/forum?id=YicbFdNTTy},
    note={under review}
}

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.gitignore		.gitignore
README.md		README.md
csv2Image.py		csv2Image.py
model&trainFer2013.py		model&trainFer2013.py
model.py		model.py
model_grim.py		model_grim.py
model_modified.py		model_modified.py
requirements.txt		requirements.txt
test_code.py		test_code.py
train.py		train.py
train_cifar100.py		train_cifar100.py
train_fer2013.py		train_fer2013.py
vit.png		vit.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vision Transformer (ViT)

Install dependencies

Train model

Citation

About

Releases

Packages

Contributors 2

Languages

JiJiGuoGuo/TransformerOnImage

Folders and files

Latest commit

History

Repository files navigation

Vision Transformer (ViT)

Install dependencies

Train model

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages