Transformer Models In Unsupervised Semantic Segmentation

The purpose of this repository is to explore the application of the Vision Transformer in a variety of semantic segmentaiton applications. Mainly, we want to explore a common method of convolutional unsupervised image segmentation, and then both supervised and unsupervised methods of transformer based segmentations. All credit goes to the original authors and we have done our best to cite all code and ideas borrowed from them.

The three architecutures that we will explore are:

WNet - A Fully Convolutional Method of Unsupervised Segmentation
Segmenter - Supervised Segmentation powered by Transformers
DINO - Unsupervised Attention Segmentation via Contrastive Learning

Training Script

# Training for Segmenter Model
python -m train --model=segmenter --batch-size=16 --epochs=30 \
    --learning-rate=0.001 --pretrained --save-model --save-logs

# Training for WNet
python -m train --model=wnet --batch-size=16 --epochs=30 \
    --learning-rate=0.001 --pretrained --save-model --save-logs

Visualizations

WNet

W-Net results on the MP4 dataset

Link to the model: https://drive.google.com/file/d/1sM6D0k04HJw3UQoLRsJDSBcgb1AU9Vwl/view?usp=sharing

W-Net results on the reconstruction task

W-Net results on the ADE20K dataset

Segmenter

`Tiny' Segmenter results on the ADE20K dataset (patch size: 16x16, token size=192)

`Standard' Segmenter results on the ADE20K dataset (patch size: 8x8, token size=768)

DINO

Attention maps generated by DINO

References

For WNet, Segmenter, and DINO, we accredit the following papers and github repositories:

@misc{xia2017w,
  title={W-net: A deep model for fully unsupervised image segmentation},
  author={Xia, Xide and Kulis, Brian},
  journal={arXiv preprint arXiv:1711.08506},
  year={2017}
}

@misc{strudel2021segmenter,
      title={Segmenter: Transformer for Semantic Segmentation}, 
      author={Robin Strudel and Ricardo Garcia and Ivan Laptev and Cordelia Schmid},
      year={2021},
      eprint={2105.05633},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

@misc{caron2021emerging,
      title={Emerging Properties in Self-Supervised Vision Transformers}, 
      author={Mathilde Caron and Hugo Touvron and Ishan Misra and Hervé Jégou and Julien Mairal and Piotr Bojanowski and Armand Joulin},
      year={2021},
      eprint={2104.14294},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

@misc{fkodom,
  author={Frank Odom},
  title={wnet-unsupervised-image-segmentation},
  year={2019},
  publisher={GitHub},
  journal={GitHub repository},
  howpublished={\url{https://github.com/fkodom/wnet-unsupervised-image-segmentation}}
}

@misc{taoroalin,
  author={Tao Lin},
  title={WNet},
  year={2018},
  publisher={GitHub},
  journal={GitHub repository},
  howpublished={\url{https://github.com/taoroalin/WNet}}
}

We also gratitude the following resource, which provides us pre-trained transformer model used in our Segmenter implementation:

@misc{rw2019timm,
  author = {Ross Wightman},
  title = {PyTorch Image Models},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  doi = {10.5281/zenodo.4414861},
  howpublished = {\url{https://github.com/rwightman/pytorch-image-models}}
}

Finally, we acknowledge the dataset we used to train and evaluate our implementations:

@inproceedings{zhou2017ade20k,
  title={Scene parsing through ade20k dataset},
  author={Zhou, Bolei and Zhao, Hang and Puig, Xavier and Fidler, Sanja and Barriuso, Adela and Torralba, Antonio},
  booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition},
  pages={633--641},
  year={2017}
}

@article{zhou2019ade20k,
  title={Semantic understanding of scenes through the ade20k dataset},
  author={Zhou, Bolei and Zhao, Hang and Puig, Xavier and Xiao, Tete and Fidler, Sanja and Barriuso, Adela and Torralba, Antonio},
  journal={International Journal of Computer Vision},
  volume={127},
  number={3},
  pages={302--321},
  year={2019},
  publisher={Springer}
}

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
__pycache__		__pycache__
data		data
outputs		outputs
pic		pic
segmenter @ 20d1bfa		segmenter @ 20d1bfa
segmenter_parts		segmenter_parts
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
dataset.py		dataset.py
final_report.pdf		final_report.pdf
models.py		models.py
plot_attention_segmentation.py		plot_attention_segmentation.py
train.py		train.py
wnet.ipynb		wnet.ipynb
wnet_loss.py		wnet_loss.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Transformer Models In Unsupervised Semantic Segmentation

Training Script

Visualizations

WNet

W-Net results on the MP4 dataset

W-Net results on the reconstruction task

W-Net results on the ADE20K dataset

Segmenter

`Tiny' Segmenter results on the ADE20K dataset (patch size: 16x16, token size=192)

`Standard' Segmenter results on the ADE20K dataset (patch size: 8x8, token size=768)

DINO

Attention maps generated by DINO

References

About

Releases

Packages

Languages

kmk7733/transformers_in_vision

Folders and files

Latest commit

History

Repository files navigation

Transformer Models In Unsupervised Semantic Segmentation

Training Script

Visualizations

WNet

W-Net results on the MP4 dataset

W-Net results on the reconstruction task

W-Net results on the ADE20K dataset

Segmenter

`Tiny' Segmenter results on the ADE20K dataset (patch size: 16x16, token size=192)

`Standard' Segmenter results on the ADE20K dataset (patch size: 8x8, token size=768)

DINO

Attention maps generated by DINO

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages