TaiSu

TaiSu（紫东太素）--A 166M multimodal dataset for Chinese Vision-Language Pretraining

Dataset Construction:

Data collection
Text-based filtering
Image-text-retrieval-based filtering
Image-Captioning-based text augmentation

Dataset download

Taisu data is available now.The image urls and corresponding texts are stored in a CSV file. Baidu cloud link:

URLs&captions for TaiSu dataset: https://pan.baidu.com/s/1YITGlMF2L7EFLZrLuETJKQ?pwd=tais

Once the URLs are ready, images can be downloaded using download_tool/download.py.

Pre-extracted image embeddings

We provide the image embeddings extracted with CLIP's RN101 and ViT-B/32 variants.

Pre-extracted image features:

Pretrained models

Models trained on the web data of TaiSu and on the complete data of TaiSu are now availbale. Baidu cloud link：https://pan.baidu.com/s/1d3UKyQi7J4Qr1XE2j2V8og?pwd=0kjm

for utilization:

from models.model_infer import build_lit
lit=build_lit(visual_model_path=path/to/visual/model/state_dict,txt_model_path==path/to/textual/model/state_dict)
API:
   lit.encode_image(imgs)
   lit.encode_text(txt)

LICENCE

Unless specifically labeled otherwise, these Datasets are provided to You under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License (“CC BY-NC-SA 4.0”), with the additional terms included herein. The CC BY-NC-SA 4.0 may be accessed at https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode. When You download or use the Datasets from the Website or elsewhere, You are agreeing to comply with the terms of CC BY-NC-SA 4.0, and also agreeing to the Dataset Terms. Where these Dataset Terms conflict with the terms of CC BY-NC-SA 4.0, these Dataset Terms shall prevail. We reiterate once again that this dataset is used only for non-commercial purposes such as academic research, teaching, or scientific publications. We prohibits You from using the dataset or any derivative works for commercial purposes, such as selling data or using it for commercial gain.

Contact

Email:[email protected]

Organization: Institute of Automation, Chinese Academy of Sciences (CASIA), Beijing, China

Citation

@inproceedings{
liu2022taisu,
title={TaiSu: A 166M Large-scale High-Quality Dataset for Chinese Vision-Language Pre-training},
author={Yulong Liu and Guibo Zhu and Bin Zhu and Qi Song and Guojing Ge and Haoran Chen and GuanHui Qiao and Ru Peng and Lingxiang Wu and Jinqiao Wang},
booktitle={Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
year={2022},
url={https://openreview.net/forum?id=iAxH-ikIP0I}
}

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
download_tool		download_tool
eval		eval
imgs		imgs
semantic_filtering		semantic_filtering
train		train
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TaiSu

Dataset download

Pre-extracted image embeddings

Pretrained models

LICENCE

Contact

Citation

About

Releases

Packages

Languages

License

YulongBonjour/TaiSu

Folders and files

Latest commit

History

Repository files navigation

TaiSu

Dataset download

Pre-extracted image embeddings

Pretrained models

LICENCE

Contact

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages