Skip to content

tue-mps/benchmark-vfm-ss

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

76 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Code for "How to Benchmark Vision Foundation Models for Semantic Segmentation?" (CVPR 2024 Second Workshop on Foundation Models)

image

Getting started

  1. Download datasets. Downloading is optional depending on which datasets you intend to use.

  2. Environment setup.

    conda create -n benchmark-vfm-ss python=3.10
    conda activate benchmark-vfm-ss
  3. Install required packages.

    pip install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/cu123

    (replace with your CUDA version if not 12.3).

  4. Fine-tune a model. Here's an example for fine-tuning DINOv2 on ADE20K with the default setup on GPU 0 with 1 worker for data loading:

    python main.py fit -c configs/ade20k_linear_semantic.yaml --root /data --data.num_workers 1 --trainer.devices [0] --model.network.encoder_name vit_base_patch14_dinov2

    (replace /data with the folder where you stored the datasets)

Reproducing results from the paper

For the commands below, add --root to specify the path to where the datasets and checkpoints are stored and --data.num_workers to specify the number of workers for data loading.

If using the BEiT models, download their checkpoints and convert them to timm format using convert_beit_ckpt.ipynb.

Please note that:

  • BEiT models need a checkpoint from above (which is loaded with --model.network.ckpt_path) and apply layernorm slightly differently (so the architecture is modified with --model.network.sub_norm).
  • EVA02 models somehow show significantly lower mIoU when using torch.compile (so it is turned off with --no_compile).

Default setup:

image

  1. python main.py fit -c configs/ade20k_linear_semantic.yaml --model.network.encoder_name eva02_base_patch16_clip_224.merged2b --no_compile
  2. python main.py fit -c configs/ade20k_linear_semantic.yaml --model.network.encoder_name eva02_base_patch14_224.mim_in22k --no_compile
  3. python main.py fit -c configs/ade20k_linear_semantic.yaml --model.network.encoder_name vit_base_patch14_dinov2
  4. python main.py fit -c configs/ade20k_linear_semantic.yaml --model.network.encoder_name vit_base_patch16_224 --model.network.ckpt_path beit3_base_patch16_224.pth.timm --model.network.sub_norm True
  5. python main.py fit -c configs/ade20k_linear_semantic.yaml --model.network.encoder_name vit_base_patch16_siglip_512.webli
  6. python main.py fit -c configs/ade20k_linear_semantic.yaml --model.network.encoder_name vit_base_patch16_clip_224.dfn2b
  7. python main.py fit -c configs/ade20k_linear_semantic.yaml --model.network.encoder_name deit3_base_patch16_384.fb_in22k_ft_in1k
  8. python main.py fit -c configs/ade20k_linear_semantic.yaml --model.network.encoder_name deit3_base_patch16_384.fb_in1k
  9. python main.py fit -c configs/ade20k_linear_semantic.yaml --model.network.encoder_name vit_base_patch16_224.mae
  10. python main.py fit -c configs/ade20k_linear_semantic.yaml --model.network.encoder_name samvit_base_patch16.sa1b

Freezing the encoder:

image

  1. python main.py fit -c configs/ade20k_linear_semantic.yaml --model.network.encoder_name eva02_base_patch16_clip_224.merged2b --no_compile --model.freeze_encoder True
  2. python main.py fit -c configs/ade20k_linear_semantic.yaml --model.network.encoder_name eva02_base_patch14_224.mim_in22k --no_compile --model.freeze_encoder True
  3. python main.py fit -c configs/ade20k_linear_semantic.yaml --model.network.encoder_name vit_base_patch14_dinov2 --model.freeze_encoder True
  4. python main.py fit -c configs/ade20k_linear_semantic.yaml --model.network.encoder_name vit_base_patch16_224 --model.network.ckpt_path beit3_base_patch16_224.pth.timm --model.network.sub_norm True --model.freeze_encoder True
  5. python main.py fit -c configs/ade20k_linear_semantic.yaml --model.network.encoder_name vit_base_patch16_siglip_512.webli --model.freeze_encoder True
  6. python main.py fit -c configs/ade20k_linear_semantic.yaml --model.network.encoder_name vit_base_patch16_clip_224.dfn2b --model.freeze_encoder True
  7. python main.py fit -c configs/ade20k_linear_semantic.yaml --model.network.encoder_name deit3_base_patch16_384.fb_in22k_ft_in1k --model.freeze_encoder True
  8. python main.py fit -c configs/ade20k_linear_semantic.yaml --model.network.encoder_name deit3_base_patch16_384.fb_in1k --model.freeze_encoder True
  9. python main.py fit -c configs/ade20k_linear_semantic.yaml --model.network.encoder_name vit_base_patch16_224.mae --model.freeze_encoder True
  10. python main.py fit -c configs/ade20k_linear_semantic.yaml --model.network.encoder_name samvit_base_patch16.sa1b --model.freeze_encoder True

Changing the decoder:

image

  1. python main.py fit -c configs/ade20k_mask2former_semantic.yaml --model.network.encoder_name eva02_base_patch16_clip_224.merged2b --no_compile
  2. python main.py fit -c configs/ade20k_mask2former_semantic.yaml --model.network.encoder_name eva02_base_patch14_224.mim_in22k --no_compile
  3. python main.py fit -c configs/ade20k_mask2former_semantic.yaml --model.network.encoder_name vit_base_patch14_dinov2
  4. python main.py fit -c configs/ade20k_mask2former_semantic.yaml --model.network.encoder_name vit_base_patch16_224 --model.network.ckpt_path beit3_base_patch16_224.pth.timm --model.network.sub_norm True
  5. python main.py fit -c configs/ade20k_mask2former_semantic.yaml --model.network.encoder_name vit_base_patch16_siglip_512.webli
  6. python main.py fit -c configs/ade20k_mask2former_semantic.yaml --model.network.encoder_name vit_base_patch16_clip_224.dfn2b
  7. python main.py fit -c configs/ade20k_mask2former_semantic.yaml --model.network.encoder_name deit3_base_patch16_384.fb_in22k_ft_in1k
  8. python main.py fit -c configs/ade20k_mask2former_semantic.yaml --model.network.encoder_name deit3_base_patch16_384.fb_in1k
  9. python main.py fit -c configs/ade20k_mask2former_semantic.yaml --model.network.encoder_name vit_base_patch16_224.mae
  10. python main.py fit -c configs/ade20k_mask2former_semantic.yaml --model.network.encoder_name samvit_base_patch16.sa1b

Scaling the model:

image

  1. python main.py fit -c configs/ade20k_linear_semantic.yaml --model.network.encoder_name eva02_large_patch14_clip_336.merged2b --no_compile
  2. python main.py fit -c configs/ade20k_linear_semantic.yaml --model.network.encoder_name eva02_large_patch14_224.mim_m38m --no_compile
  3. python main.py fit -c configs/ade20k_linear_semantic.yaml --model.network.encoder_name vit_large_patch14_dinov2
  4. python main.py fit -c configs/ade20k_linear_semantic.yaml --model.network.encoder_name vit_large_patch16_224 --model.network.ckpt_path beit3_large_patch16_224.pth.timm --model.network.sub_norm True
  5. python main.py fit -c configs/ade20k_linear_semantic.yaml --model.network.encoder_name vit_large_patch16_siglip_384.webli
  6. python main.py fit -c configs/ade20k_linear_semantic.yaml --model.network.encoder_name vit_large_patch14_clip_224.dfn2b
  7. python main.py fit -c configs/ade20k_linear_semantic.yaml --model.network.encoder_name deit3_large_patch16_384.fb_in22k_ft_in1k
  8. python main.py fit -c configs/ade20k_linear_semantic.yaml --model.network.encoder_name deit3_large_patch16_384.fb_in1k
  9. python main.py fit -c configs/ade20k_linear_semantic.yaml --model.network.encoder_name vit_large_patch16_224.mae
  10. python main.py fit -c configs/ade20k_linear_semantic.yaml --model.network.encoder_name samvit_large_patch16.sa1b

Varying the patch size:

image

  1. python main.py fit -c configs/ade20k_linear_semantic.yaml --model.network.encoder_name eva02_base_patch16_clip_224.merged2b --no_compile --model.network.patch_size 8
  2. python main.py fit -c configs/ade20k_linear_semantic.yaml --model.network.encoder_name eva02_base_patch14_224.mim_in22k --no_compile
  3. python main.py fit -c configs/ade20k_linear_semantic.yaml --model.network.encoder_name vit_base_patch14_dinov2 --model.network.patch_size 8
  4. python main.py fit -c configs/ade20k_linear_semantic.yaml --model.network.encoder_name vit_base_patch16_224 --model.network.ckpt_path beit3_base_patch16_224.pth.timm --model.network.sub_norm True --model.network.patch_size 8
  5. python main.py fit -c configs/ade20k_linear_semantic.yaml --model.network.encoder_name vit_base_patch16_siglip_512.webli --model.network.patch_size 8
  6. python main.py fit -c configs/ade20k_linear_semantic.yaml --model.network.encoder_name vit_base_patch16_clip_224.dfn2b --model.network.patch_size 8
  7. python main.py fit -c configs/ade20k_linear_semantic.yaml --model.network.encoder_name deit3_base_patch16_384.fb_in22k_ft_in1k --model.network.patch_size 8
  8. python main.py fit -c configs/ade20k_linear_semantic.yaml --model.network.encoder_name deit3_base_patch16_384.fb_in1k --model.network.patch_size 8
  9. python main.py fit -c configs/ade20k_linear_semantic.yaml --model.network.encoder_name vit_base_patch16_224.mae --model.network.patch_size 8
  10. python main.py fit -c configs/ade20k_linear_semantic.yaml --model.network.encoder_name samvit_base_patch16.sa1b --model.network.patch_size 8

Changing the downstream dataset (PASCAL VOC):

image

  1. python main.py fit -c configs/pascal_voc_linear_semantic.yaml --model.network.encoder_name eva02_base_patch16_clip_224.merged2b --no_compile
  2. python main.py fit -c configs/pascal_voc_linear_semantic.yaml --model.network.encoder_name eva02_base_patch14_224.mim_in22k --no_compile
  3. python main.py fit -c configs/pascal_voc_linear_semantic.yaml --model.network.encoder_name vit_base_patch14_dinov2
  4. python main.py fit -c configs/pascal_voc_linear_semantic.yaml --model.network.encoder_name vit_base_patch16_224 --model.network.ckpt_path beit3_base_patch16_224.pth.timm --model.network.sub_norm True
  5. python main.py fit -c configs/pascal_voc_linear_semantic.yaml --model.network.encoder_name vit_base_patch16_siglip_512.webli
  6. python main.py fit -c configs/pascal_voc_linear_semantic.yaml --model.network.encoder_name vit_base_patch16_clip_224.dfn2b
  7. python main.py fit -c configs/pascal_voc_linear_semantic.yaml --model.network.encoder_name deit3_base_patch16_384.fb_in22k_ft_in1k
  8. python main.py fit -c configs/pascal_voc_linear_semantic.yaml --model.network.encoder_name deit3_base_patch16_384.fb_in1k
  9. python main.py fit -c configs/pascal_voc_linear_semantic.yaml --model.network.encoder_name vit_base_patch16_224.mae
  10. python main.py fit -c configs/pascal_voc_linear_semantic.yaml --model.network.encoder_name samvit_base_patch16.sa1b

Changing the downstream dataset (Cityscapes):

image

  1. python main.py fit -c configs/cityscapes_linear_semantic.yaml --model.network.encoder_name eva02_base_patch16_clip_224.merged2b --no_compile
  2. python main.py fit -c configs/cityscapes_linear_semantic.yaml --model.network.encoder_name eva02_base_patch14_224.mim_in22k --no_compile
  3. python main.py fit -c configs/cityscapes_linear_semantic.yaml --model.network.encoder_name vit_base_patch14_dinov2
  4. python main.py fit -c configs/cityscapes_linear_semantic.yaml --model.network.encoder_name vit_base_patch16_224 --model.network.ckpt_path beit3_base_patch16_224.pth.timm --model.network.sub_norm True
  5. python main.py fit -c configs/cityscapes_linear_semantic.yaml --model.network.encoder_name vit_base_patch16_siglip_512.webli
  6. python main.py fit -c configs/cityscapes_linear_semantic.yaml --model.network.encoder_name vit_base_patch16_clip_224.dfn2b
  7. python main.py fit -c configs/cityscapes_linear_semantic.yaml --model.network.encoder_name deit3_base_patch16_384.fb_in22k_ft_in1k
  8. python main.py fit -c configs/cityscapes_linear_semantic.yaml --model.network.encoder_name deit3_base_patch16_384.fb_in1k
  9. python main.py fit -c configs/cityscapes_linear_semantic.yaml --model.network.encoder_name vit_base_patch16_224.mae
  10. python main.py fit -c configs/cityscapes_linear_semantic.yaml --model.network.encoder_name samvit_base_patch16.sa1b

Introducing a domain shift:

image

  1. python main.py fit -c configs/gta5_linear_semantic.yaml --model.network.encoder_name eva02_base_patch16_clip_224.merged2b --no_compile
  2. python main.py fit -c configs/gta5_linear_semantic.yaml --model.network.encoder_name eva02_base_patch14_224.mim_in22k --no_compile
  3. python main.py fit -c configs/gta5_linear_semantic.yaml --model.network.encoder_name vit_base_patch14_dinov2
  4. python main.py fit -c configs/gta5_linear_semantic.yaml --model.network.encoder_name vit_base_patch16_224 --model.network.ckpt_path - beit3_base_patch16_224.pth.timm --model.network.sub_norm True
  5. python main.py fit -c configs/gta5_linear_semantic.yaml --model.network.encoder_name vit_base_patch16_siglip_512.webli
  6. python main.py fit -c configs/gta5_linear_semantic.yaml --model.network.encoder_name vit_base_patch16_clip_224.dfn2b
  7. python main.py fit -c configs/gta5_linear_semantic.yaml --model.network.encoder_name deit3_base_patch16_384.fb_in22k_ft_in1k
  8. python main.py fit -c configs/gta5_linear_semantic.yaml --model.network.encoder_name deit3_base_patch16_384.fb_in1k
  9. python main.py fit -c configs/gta5_linear_semantic.yaml --model.network.encoder_name vit_base_patch16_224.mae
  10. python main.py fit -c configs/gta5_linear_semantic.yaml --model.network.encoder_name samvit_base_patch16.sa1b

Acknowledgement

We borrow some code from Hugging Face Transformers (https://github.com/huggingface/transformers) (Apache-2.0 License)