Skip to content
/ VAS Public

Variance Alignment Score: A Simple But Tough-to-Beat Data Selection Method for Multimodal Contrastive Learning

Notifications You must be signed in to change notification settings

ypwang61/VAS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Variance Alignment Score: A Simple But Tough-to-Beat Data Selection Method for Multimodal Contrastive Learning

[ Paper ]

This is an official code of Variance Alignment score (VAS), which is a simple but efficient data selection method for CLIP model.

If you found this repository, our paper or data useful, please consider citing:

@article{wang2024variance,
  title={Variance Alignment Score: A Simple But Tough-to-Beat Data Selection Method for Multimodal Contrastive Learning},
  author={Wang, Yiping and Chen, Yifang and Yan, Wendan and Jamieson, Kevin and Du, Simon Shaolei},
  journal={arXiv preprint arXiv:2402.02055},
  year={2024}
}

Overview

Recently, data selection has emerged as a core issue for large-scale visual-language model pretraining like training CLIP, especially on noisy web-curated datasets. In this scenario, we design a simple but efficient data selection method named Variance Alignment score (VAS) filtering. It utilizes a very simple formula to evaluate the informativeness of data, which is an important but neglected factor of quality scores such as CLIP similarities. Details in our paper.

Illustration of our VAS on DataComp

Installing Dependencies and Downloading Dataset

Please follow DataComp to install the dependencies and DataComp dataset. Our code supports DataComp-Small (12.8M data in total, need 528G space) and DataComp-Medium (128M data in total, need 5.28T space) now.

Run VAS/VAS-D

We can run VAS/VAS-D + CLIP score filtering by execuating run_datacomp_small.sh for DataComp-Small dataset and run_datacomp_medium.sh for DataComp-Medium dataset. Here in these files, we recommend the path to DataComp-x dataset as path/to/datasets/datacomp_x/, and the path to files as path/to/files/datacomp_x/. Please first specific the dataset_path and files_path in these bash files before executing them. Besides, our codes are also compatible to DataComp and support running their baselines.

Run experiments

After running filtering algorithm, we can run run_exp.sh to realize (1) resharding training dataset, (2) training model, (3) evaluating on 38 downstream tasks. Please first specific the num_gpus, files_path, scale, datacomp_scale and filter_list in run_exp.sh as the examples given inside.

Checkpoints and UIDs

The checkpoints and uids for DataComp-Small and DataComp-Medium are shared in this link.

Acknowlegements

We thank the authors of DataComp for open sourcing their codes.

About

Variance Alignment Score: A Simple But Tough-to-Beat Data Selection Method for Multimodal Contrastive Learning

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published