You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm noticing very subpar performance from SimMIM on my task compared to MAE, and this also seems to be an issue on the Imagenette benchmarks. I was wondering what might be causing this, and whether we'd still see performance issues with a non-ViT backbone. Is it possible to use backbones like convnets and Swin transformers with the current implementation of SimMIM? I'm curious how you'd need to change the forward_encoder method to do so, and whether images_to_tokens could be generalized to other backbones.
defforward_encoder(self, images, batch_size, idx_mask):
# pass all the tokens to the encoder, both masked and non masked onestokens=self.backbone.images_to_tokens(images, prepend_class_token=True)
tokens_masked=utils.mask_at_index(tokens, idx_mask, self.mask_token)
returnself.backbone.encoder(tokens_masked)
The text was updated successfully, but these errors were encountered:
But I guess a simple linear layer might be enough in our setup.
Also note that we measure performance using KNN which is lower than linear eval/finetuning. ViT based architectures generally require finetuning for good performance.
I'm noticing very subpar performance from SimMIM on my task compared to MAE, and this also seems to be an issue on the Imagenette benchmarks. I was wondering what might be causing this, and whether we'd still see performance issues with a non-ViT backbone. Is it possible to use backbones like convnets and Swin transformers with the current implementation of SimMIM? I'm curious how you'd need to change the
forward_encoder
method to do so, and whetherimages_to_tokens
could be generalized to other backbones.The text was updated successfully, but these errors were encountered: