Variance Alignment Score: A Simple But Tough-to-Beat Data Selection Method for Multimodal Contrastive Learning

This is an official code of Variance Alignment score (VAS), which is a simple but efficient data selection method for CLIP model.

If you found this repository, our paper or data useful, please consider citing:

@article{wang2024variance,
  title={Variance Alignment Score: A Simple But Tough-to-Beat Data Selection Method for Multimodal Contrastive Learning},
  author={Wang, Yiping and Chen, Yifang and Yan, Wendan and Jamieson, Kevin and Du, Simon Shaolei},
  journal={arXiv preprint arXiv:2402.02055},
  year={2024}
}

Overview

Recently, data selection has emerged as a core issue for large-scale visual-language model pretraining like training CLIP, especially on noisy web-curated datasets. In this scenario, we design a simple but efficient data selection method named Variance Alignment score (VAS) filtering. It utilizes a very simple formula to evaluate the informativeness of data, which is an important but neglected factor of quality scores such as CLIP similarities. Details in our paper.

Installing Dependencies and Downloading Dataset

Please follow DataComp to install the dependencies and DataComp dataset. Our code supports DataComp-Small (12.8M data in total, need 528G space) and DataComp-Medium (128M data in total, need 5.28T space) now.

Run VAS/VAS-D

We can run VAS/VAS-D + CLIP score filtering by execuating run_datacomp_small.sh for DataComp-Small dataset and run_datacomp_medium.sh for DataComp-Medium dataset. Here in these files, we recommend the path to DataComp-x dataset as path/to/datasets/datacomp_x/, and the path to files as path/to/files/datacomp_x/. Please first specific the dataset_path and files_path in these bash files before executing them. Besides, our codes are also compatible to DataComp and support running their baselines.

Run experiments

After running filtering algorithm, we can run run_exp.sh to realize (1) resharding training dataset, (2) training model, (3) evaluating on 38 downstream tasks. Please first specific the num_gpus, files_path, scale, datacomp_scale and filter_list in run_exp.sh as the examples given inside.

Checkpoints and UIDs

The checkpoints and uids for DataComp-Small and DataComp-Medium are shared in this link.

Acknowlegements

We thank the authors of DataComp for open sourcing their codes.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
baselines		baselines
eval_utils		eval_utils
figs		figs
.gitignore		.gitignore
README.md		README.md
aggregate_scores.py		aggregate_scores.py
baselines.py		baselines.py
cal_clipscores.py		cal_clipscores.py
create_env.sh		create_env.sh
download_evalsets.py		download_evalsets.py
download_upstream.py		download_upstream.py
environment.yml		environment.yml
environment_osx.yml		environment_osx.yml
eval_test_config.yml		eval_test_config.yml
evaluate.py		evaluate.py
requirements.txt		requirements.txt
resharder.py		resharder.py
run_datacomp_medium.sh		run_datacomp_medium.sh
run_datacomp_small.sh		run_datacomp_small.sh
run_exp.sh		run_exp.sh
scale_configs.py		scale_configs.py
slurm_evaluate.sh		slurm_evaluate.sh
slurm_train.sh		slurm_train.sh
tasklist.yml		tasklist.yml
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Variance Alignment Score: A Simple But Tough-to-Beat Data Selection Method for Multimodal Contrastive Learning

Overview

Installing Dependencies and Downloading Dataset

Run VAS/VAS-D

Run experiments

Checkpoints and UIDs

Acknowlegements

About

Releases

Packages

Languages

ypwang61/VAS

Folders and files

Latest commit

History

Repository files navigation

Variance Alignment Score: A Simple But Tough-to-Beat Data Selection Method for Multimodal Contrastive Learning

Overview

Installing Dependencies and Downloading Dataset

Run VAS/VAS-D

Run experiments

Checkpoints and UIDs

Acknowlegements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages