The Grounding-anything Dataset (GranD) dataset offers densely annotated data, acquired through an automated annotation pipeline that leverages state-of-the-art (SOTA) vision and V-L models. This documentation covers how to download the GranD dataset and a guide to the automated annotation pipeline used to create GranD.
- Annotations: MBZUAI/GranD
- Images: Download GranD utilizes images from the SAM dataset.
Note: Please note that annotations are being uploaded incrementally and more parts will be available soon.
After downloading the GranD annotations, utilize the scripts below to transform them into GLaMM pretraining data, or to prepare them for your specific tasks.
- For object-level tasks like object detection, semantic segmentation: prepare_object_lvl_data.py
- For image-level captioning and caption grounding: prepare_grand_caption_grounding.py
- For referring expression generation and referring expression segmentation: prepare_grand_referring_expression
The above scripts generate annotations in JSON format. To convert these for use in pretraining datasets requiring LMDB format, please use to the following scripts:
- To convert to lmdb: get_txt_for_lmdb.py
- To extract file names in txt format: get_txt_for_lmdb.py
GranD is a comprehensive, multi-purpose image-text dataset offering a range of contextual information, from fine-grained to high-level details. The pipeline contains four distinct levels. The code for the four levels are provided in: GranD
More detailed information:
- To run the entire pipeline: run_pipeline.sh
- To set up the environments detailed in run_pipeline.sh refer to : environments
- Level-1 : Object Localization and Attributes
- Landmark Categorization: landmark
- Depth Map Estimation: Midas Depth Estimation
- Image Tagging: RAM Tag2Text Tagging
- Standard Object Detection: CO-DETR OD, EVA OD
- Open Vocabulary Object Detection: OWL-ViT OVD, POMP OVD
- Attribute Detection and Grounding: Attribute & Grounidng GRiT
- Open Vocabulary Classification: OV Classification OV-SAM
- Combine the predictions: Merging
- Generate Level-1 Scene Graph: Level-1 Scene Graph
- Level-2: Relationships
- Captioning: BLIP-2 Captioning, LLaVA Captioning
- Grounding Short Captions: MDETR Grounding
- Combine the predictions: Merging
- Generate Level-2 Scene Graph and Update Level-1: Level-2 Scene Graph
- Enrich Attributes: GPT4-RoI Attributes
- Label Assignment: EVA-CLIP Label Assignment
- Level-3: Scene Graph and Dense Captioning
- Generate Dense Captions: Scene graph dense captioning LLaVA
- Level-4: Extra Contextual Insight:
- Generate Level-4 Additional Context: Extra Context