This is the official repository for paper "Mono3DVG: 3D Visual Grounding in Monocular Images". [AAAI paper] [ArXiv paper] [AAAI Video/Poster]
The paper has been accepted by AAAI 2024 🎉.
School of Artificial Intelligence, OPtics, and ElectroNics (iOPEN), Northwestern Polytechnical University
- Dec-09-2023: Mono3DVG paper is accepted by AAAI2024. 🔥🔥
- Dec-29-2023: Mono3DRefer dataset is released. 🔥🔥
- Mar-13-2024: Mono3DVG-TR codebase and checkpoint are released. 🔥🔥
- 📦 Components for the result visualization of Mono3DVG coming soon! 🚀
We introduce a novel task of 3D visual grounding in monocular RGB images using descriptions with appearance and geometry information, termed Mono3DVG. Mono3DVG aims to localize the true 3D extent of referred objects in an image using language descriptions with geometry information.
Download our Mono3DRefer dataset. We build the first dataset for Mono3DVG, termed Mono3DRefer, which can be downloaded from our Google Drive. The download link is available below:
https://drive.google.com/drive/folders/1ICBv0SRbRIUnl_z8DVuH8lz7KQt580EI?usp=drive_link
Mono3DVG-TR is the first end-to-end transformer-based network for monocular 3D visual grounding.
You can follow the environment of MonoDETR.
1.2 Install pytorch and torchvision matching your CUDA version: torch >= 1.9.0, our version of torch == 1.13.1
pip install -r requirements.txt
cd lib/models/mono3dvg/ops/
bash make.sh
cd ../../../..
1.4 Download Mono3DRefer datasets and prepare the directory structure as:
│Mono3DVG/
├──Mono3DRefer/
│ ├──images/
│ │ ├──000000.png
│ │ ├──...
│ ├──calib/
│ │ ├──000000.txt
│ │ ├──...
│ ├──Mono3DRefer_train_image.txt
│ ├──Mono3DRefer_val_image.txt
│ ├──Mono3DRefer_test_image.txt
│ ├──Mono3DRefer.json
│ ├──test_instanceID_split.json
├──configs
│ ├──mono3dvg.yaml
│ ├──checkpoint_best_MonoDETR.pth
├──lib
│ ├──datasets/
│ │ ├──...
│ ├──helpers/
│ │ ├──...
│ ├──losses/
│ │ ├──...
│ ├──models/
│ │ ├──...
├──roberta-base
│ ├──...
├──utils
│ ├──...
├──outputs # save_path
│ ├──mono3dvg
│ │ ├──...
├──test.py
├──train.py
You can also change the dataset path at "root_dir" in configs/mono3dvg.yaml
.
You can also change the save path at "save_path" in configs/mono3dvg.yaml
.
You must download the Pre-trained model of RoBERTa and MonoDETR.
You can download the checkpoint we provide to evaluate the Mono3DVG-TR model.
Models | Links | File Path | File Name |
RoBERTa | model | `roberta-base\` | `pytorch_model.bin` |
Pre-trained model (MonoDETR) | model | `configs\` | `checkpoint_best_MonoDETR.pth` |
Best checkpoint (Mono3DVG-TR) | model | `outputs\mono3dvg\` | `checkpoint_best.pth` |
You can modify the settings of GPU, models and training in configs/mono3dvg.yaml
CUDA_VISIBLE_DEVICES=1 python train.py
The best checkpoint will be evaluated as default.
You can change it at "pretrain_model: 'checkpoint_best.pth'" in configs/mono3dvg.yaml
:
CUDA_VISIBLE_DEVICES=1 python test.py
@inproceedings{zhan2024mono3dvg,
title={Mono3DVG: 3D Visual Grounding in Monocular Images},
author={Zhan, Yang and Yuan, Yuan and Xiong, Zhitong},
booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
volume={38},
number={7},
pages={6988--6996},
year={2024}
}
Our code is based on (ICCV 2023)MonoDETR. We sincerely appreciate their contributions and authors for releasing source codes. I would like to thank Xiong zhitong and Yuan yuan for helping the manuscript. I also thank the School of Artificial Intelligence, OPtics, and ElectroNics (iOPEN), Northwestern Polytechnical University for supporting this work.
If you have any questions about this project, please feel free to contact [email protected].