CLIP-Mamba: CLIP Pretrained Mamba Models with OOD and Hessian Evaluation

Abstract

'''State space models and Mamba-based models have been increasingly applied across various domains, achieving state-of-the-art performance. This technical report introduces the first attempt to train a transferable Mamba model utilizing contrastive language-image pretraining (CLIP). We have trained Mamba models of varying sizes and undertaken comprehensive evaluations of these models on 26 zero-shot classification datasets and 16 out-of-distribution (OOD) datasets. Our findings reveal that a Mamba model with 67 million parameters is on par with a 307 million-parameter Vision Transformer (ViT) model in zero-shot classification tasks, highlighting the parameter efficiency of Mamba models. In tests of OOD generalization, Mamba-based models exhibit exceptional performance in conditions of OOD image contrast or when subjected to high-pass filtering. However, a Hessian analysis indicates that Mamba models feature a sharper and more non-convex landscape compared to ViT-based models, making them more challenging to train.'''

Main results

Zero-shot performance of different architectures trained with CLIP

Methods	Food-101	CIFAR-10	CIFAR-100	CUB	SUN397	Cars	Aircraft	DTD	Pets	Caltech-101	Flowers	MNIST	FER-2013	STL-10	EuroSAT	RESISC45	GTSRB	KITTI	Country211	PCAM	UCF101	Kinetics700	CLEVR	HatefulMemes	SST2	ImageNet
VMamba_B (89M)	48.5	58.0	29.9	36.5	50.4	5.8	8.5	26.5	30.2	64.7	52.8	9.7	19.6	91.9	16.0	30.4	7.9	40.2	10.2	59.9	35.2	25.6	12.6	51.6	50.1	38.3
VMamba_S (50M)	49.4	70.3	34.3	39.1	53.9	6.9	8.4	26.0	31.3	68.7	54.1	10.1	9.8	92.8	17.6	31.4	6.9	23.5	10.9	54.2	38.4	27.1	13.2	50.5	50.0	40.0
VMamba_T220 (30M)	46.5	50.9	22.9	35.6	51.1	5.7	6.8	25.1	31.0	64.9	54.0	10.1	12.5	91.6	13.9	25.4	10.7	32.3	9.9	55.0	34.0	25.1	12.7	53.9	50.6	38.7
Simba_L (66.6M)	52.7	67.4	31.0	39.1	52.7	6.9	9.1	27.8	33.4	68.9	55.9	8.0	16.0	93.9	17.4	32.3	8.9	41.5	11.1	58.1	35.7	27.9	12.1	54.9	50.1	41.6
VIT_B(84M)	50.6	66.0	34.5	38.8	51.1	4.0	5.4	21.2	28.5	60.9	53.3	8.4	17.3	90.5	30.2	21.5	6.1	35.1	10.5	53.5	28.5	22.1	10.8	52.4	50.7	37.6
VIT-L(307M)	59.5	72.9	41.5	40.3	53.6	6.9	6.4	20.6	27.9	65.4	55.0	10.3	34.5	94.2	22.7	28.8	5.8	41.4	12.5	54.9	34.3	24.0	12.9	54.3	50.1	40.4

Acknowledgment

This project is based on A-CLIP (paper, code), VMamba (paper, code), SiMBA (paper, code), thanks for their excellent works.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
kernels/selective_scan		kernels/selective_scan
vmamba		vmamba
.gitignore		.gitignore
README.md		README.md
bpe_simple_vocab_16e6.txt.gz		bpe_simple_vocab_16e6.txt.gz
dataset_catalog.json		dataset_catalog.json
datasets.py		datasets.py
eval_retrieval.py		eval_retrieval.py
eval_zeroshot.py		eval_zeroshot.py
labels.json		labels.json
losses.py		losses.py
main.py		main.py
main_linear.py		main_linear.py
make_dataset.py		make_dataset.py
models.py		models.py
retrieval.py		retrieval.py
run.sh		run.sh
simba.py		simba.py
templates.json		templates.json
tokenizer.py		tokenizer.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CLIP-Mamba: CLIP Pretrained Mamba Models with OOD and Hessian Evaluation

Abstract

Main results

Zero-shot performance of different architectures trained with CLIP

Acknowledgment

About

Releases

Packages

Languages

raytrun/mamba-clip

Folders and files

Latest commit

History

Repository files navigation

CLIP-Mamba: CLIP Pretrained Mamba Models with OOD and Hessian Evaluation

Abstract

Main results

Zero-shot performance of different architectures trained with CLIP

Acknowledgment

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages