Skip to content
forked from ksOAn6g5/TaiSu

TaiSu(太素)--a large scale Chinese multimodal dataset

License

Notifications You must be signed in to change notification settings

YulongBonjour/TaiSu

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

61 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TaiSu

TaiSu(紫东太素)--A 166M multimodal dataset for Chinese Vision-Language Pretraining

  • Dataset Construction:
  1. Data collection
  2. Text-based filtering
  3. Image-text-retrieval-based filtering
  4. Image-Captioning-based text augmentation word cloud

Dataset download

Taisu data is available now.The image urls and corresponding texts are stored in a CSV file. Baidu cloud link:

Once the URLs are ready, images can be downloaded using download_tool/download.py.

Pre-extracted image embeddings

We provide the image embeddings extracted with CLIP's RN101 and ViT-B/32 variants.

  • Pre-extracted image features:

Pretrained models

Models trained on the web data of TaiSu and on the complete data of TaiSu are now availbale. Baidu cloud link:https://pan.baidu.com/s/1d3UKyQi7J4Qr1XE2j2V8og?pwd=0kjm

  • for utilization:
from models.model_infer import build_lit
lit=build_lit(visual_model_path=path/to/visual/model/state_dict,txt_model_path==path/to/textual/model/state_dict)
API:
   lit.encode_image(imgs)
   lit.encode_text(txt) 

LICENCE

Unless specifically labeled otherwise, these Datasets are provided to You under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License (“CC BY-NC-SA 4.0”), with the additional terms included herein. The CC BY-NC-SA 4.0 may be accessed at https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode. When You download or use the Datasets from the Website or elsewhere, You are agreeing to comply with the terms of CC BY-NC-SA 4.0, and also agreeing to the Dataset Terms. Where these Dataset Terms conflict with the terms of CC BY-NC-SA 4.0, these Dataset Terms shall prevail. We reiterate once again that this dataset is used only for non-commercial purposes such as academic research, teaching, or scientific publications. We prohibits You from using the dataset or any derivative works for commercial purposes, such as selling data or using it for commercial gain.

Contact

Email:[email protected]

Organization: Institute of Automation, Chinese Academy of Sciences (CASIA), Beijing, China

Citation

@inproceedings{
liu2022taisu,
title={TaiSu: A 166M Large-scale High-Quality Dataset for Chinese Vision-Language Pre-training},
author={Yulong Liu and Guibo Zhu and Bin Zhu and Qi Song and Guojing Ge and Haoran Chen and GuanHui Qiao and Ru Peng and Lingxiang Wu and Jinqiao Wang},
booktitle={Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
year={2022},
url={https://openreview.net/forum?id=iAxH-ikIP0I}
}

About

TaiSu(太素)--a large scale Chinese multimodal dataset

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 97.1%
  • Shell 2.9%