Skip to content

Latest commit

 

History

History
109 lines (95 loc) · 5.66 KB

README_DATA.md

File metadata and controls

109 lines (95 loc) · 5.66 KB

Datasets

We will update this file as soon as possible!

We provide links to download our preprocessed dataset. If you would like to process the data on your own, we will soon provide scripts for you to do so.

Note that you should replace the image/video/speech file paths in the json files according to your storage path.

And please use your own file path to replace the original path in xllm/configs/datasets/*/*.yaml or xllm/projects/train/*.yaml

Image Interface

The pretraining datasets used in X-LLM are all publicly available. Here we provide the public links to these data, it is recommended that you download images pf the data from the links first, and then link the image paths with the downloaded dataset json (Chinese) we provided.

DatasetImageDataLanguage
CC3MImage UrlData JsonZH
MSCOCOImage UrlData JsonZH
Visual GenomeImage UrlData JsonZH
Flickr30kImage UrlData JsonZH
SBUImage UrlData JsonZH
AI Challenger captionsImage UrlData JsonZH
Wukong captionsImage UrlData JsonZH

Please note that for the Wukong dataset, we filtered the first 50 million images using Chinese-CLIP (Vit-B-16 model) and only kept samples with a visual-textual similarity score greater than 0.475. Additionally, you will need to pair the captions with the corresponding images based on the image captions.

Data Format

[
    {
        "image": "train2014/COCO_train2014_000000013356.jpg",
        "caption": [
            "一个站在玻璃附近的白衣男子",
            "一个人在破旧的浴室里穿着防护服和面具",
            "一个人从头到脚穿着白色涂在房间里",
            "浴室正在装修,一个人在墙上画画",
            "一个穿着防护服的人在房间里工作"
        ],
        "image_id": "train2014/COCO_train2014_000000013356.jpg",
        "dataset": "coco_zh"
    },
]

or

[
    {
        "image": "/raid/cfl/en_pretraining/data/images/sbu/pythonDownload/subpic/5eda85e140.jpg",
        "caption": "谢菲尔德公园花园苏塞克斯湖边的老树",
        "image_id": "5eda85e140.jpg",
        "dataset": "sbu_zh"
    },
]

We do not use the item "image_id", which is the same as "image" most cases. Note that you should replace the image paths in the json files according to your storage path.

Speech Interface

We provide the public links to speech data (*.wav & feats), it is recommended that you download the data from the links first, and then link the speech data paths with the downloaded dataset json we provided.

DatasetAudio/FeaturesDataLanguage
AISHELL-2Audio/FeaturesData JsonZH
VSDial-CNAudio/FeaturesData JsonZH


Video Interface

The pretraining datasets used in X-LLM are all publicly available. Here we provide the public links to these data, it is recommended that you download video pf the data from the links first, and then link the video paths with the downloaded dataset json (Chinese) we provided.

DatasetVideoData
MSRVTTVideo UrlData Json
ActivityNetVideo UrlData Json


Evaluation

We provide the Chinese version of LLaVA test, which is an evaluation dataset with 30 unseen images is constructed: each image is assocaited with three types of instructions: conversation, detailed description and complex reasoning.