Skip to content

BAAI-DCAI/Dataset-Pruning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Dataset Pruning

In the field of computer vision and multimodal learning, the emerging large models, e.g., vision transformers, CLIP, EVA, SAM, Emu, can achieve various tasks and significantly outperform the traditional neural networks, when large-scale training data, e.g., ImageNet-21K, JFT-300M, LAION-5B is available.

However, storing large datasets and training on them are expensive and even unaffordable. It is known that large-scale datasets have much redundant and easy samples which contribute little to model training.

Dataset pruning (or coreset selection) aims to remove those less-informative training samples and remain the informative ones of original dataset, such that models trained on remained subset can achieve comparable performance.

This repository contains code of pruning large-scale datasets written by BAAI-DCAI, including ImageNet and LAION. Please open the corresponding folder for more information: ImageNet and LAION.

We have released some coresets of ImageNet-1K/21K. More coresets are coming! If you urgently need the compressed ImageNet-1K/21K or LAION-2B, feel free to contact us: [email protected].

About

Dataset pruning for ImageNet and LAION-2B.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages