Dataset Pruning

In the field of computer vision and multimodal learning, the emerging large models, e.g., vision transformers, CLIP, EVA, SAM, Emu, can achieve various tasks and significantly outperform the traditional neural networks, when large-scale training data, e.g., ImageNet-21K, JFT-300M, LAION-5B is available.

However, storing large datasets and training on them are expensive and even unaffordable. It is known that large-scale datasets have much redundant and easy samples which contribute little to model training.

Dataset pruning (or coreset selection) aims to remove those less-informative training samples and remain the informative ones of original dataset, such that models trained on remained subset can achieve comparable performance.

This repository contains code of pruning large-scale datasets written by BAAI-DCAI, including ImageNet and LAION. Please open the corresponding folder for more information: ImageNet and LAION.

We have released some coresets of ImageNet-1K/21K. More coresets are coming! If you urgently need the compressed ImageNet-1K/21K or LAION-2B, feel free to contact us: [email protected].

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
ImageNet		ImageNet
LAION		LAION
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dataset Pruning

About

Releases

Packages

Contributors 2

Languages

License

BAAI-DCAI/Dataset-Pruning

Folders and files

Latest commit

History

Repository files navigation

Dataset Pruning

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages