SimMIM and non-ViT backbones #1135

faris-k · 2023-04-07T11:57:30Z

I'm noticing very subpar performance from SimMIM on my task compared to MAE, and this also seems to be an issue on the Imagenette benchmarks. I was wondering what might be causing this, and whether we'd still see performance issues with a non-ViT backbone. Is it possible to use backbones like convnets and Swin transformers with the current implementation of SimMIM? I'm curious how you'd need to change the forward_encoder method to do so, and whether images_to_tokens could be generalized to other backbones.

def forward_encoder(self, images, batch_size, idx_mask):
    # pass all the tokens to the encoder, both masked and non masked ones
	tokens = self.backbone.images_to_tokens(images, prepend_class_token=True)
	tokens_masked = utils.mask_at_index(tokens, idx_mask, self.mask_token)
	return self.backbone.encoder(tokens_masked)

The text was updated successfully, but these errors were encountered:

guarin · 2023-04-11T07:34:33Z

Hi @faris-k! We noticed this as well but I didn't look into it yet. From a quick glance at a code I believe the linear decoder head might be missing. The reference implementation is here: https://github.com/microsoft/SimMIM/blob/d3e29bcac950b83edc34ca33fe4404f38309052c/models/simmim.py#L104

But I guess a simple linear layer might be enough in our setup.

Also note that we measure performance using KNN which is lower than linear eval/finetuning. ViT based architectures generally require finetuning for good performance.

guarin mentioned this issue Apr 26, 2023

Self-Supervised Methods Tracker #1172

Open

6 tasks

guarin added the bug label Aug 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SimMIM and non-ViT backbones #1135

SimMIM and non-ViT backbones #1135

faris-k commented Apr 7, 2023

guarin commented Apr 11, 2023

SimMIM and non-ViT backbones #1135

SimMIM and non-ViT backbones #1135

Comments

faris-k commented Apr 7, 2023

guarin commented Apr 11, 2023