Mono3DVG: 3D Visual Grounding in Monocular Images

Author: Yang Zhan, Yuan Yuan, Zhitong Xiong

This is the official repository for paper "Mono3DVG: 3D Visual Grounding in Monocular Images". [AAAI paper] [ArXiv paper] [AAAI Video/Poster]

The paper has been accepted by AAAI 2024 🎉.

School of Artificial Intelligence, OPtics, and ElectroNics (iOPEN), Northwestern Polytechnical University

Please share a STAR ⭐ if this project does help

📢 Latest Updates

Dec-09-2023: Mono3DVG paper is accepted by AAAI2024. 🔥🔥
Dec-29-2023: Mono3DRefer dataset is released. 🔥🔥
Mar-13-2024: Mono3DVG-TR codebase and checkpoint are released. 🔥🔥
📦 Components for the result visualization of Mono3DVG coming soon! 🚀

💬 Mono3DVG: 3D Visual Grounding in Monocular Images

We introduce a novel task of 3D visual grounding in monocular RGB images using descriptions with appearance and geometry information, termed Mono3DVG. Mono3DVG aims to localize the true 3D extent of referred objects in an image using language descriptions with geometry information.

🌋 Dataset

Download our Mono3DRefer dataset. We build the first dataset for Mono3DVG, termed Mono3DRefer, which can be downloaded from our Google Drive. The download link is available below:

https://drive.google.com/drive/folders/1ICBv0SRbRIUnl_z8DVuH8lz7KQt580EI?usp=drive_link

Mono3DVG-TR: Architecture

Mono3DVG-TR is the first end-to-end transformer-based network for monocular 3D visual grounding.

📦 Mono3DVG-TR Codebase

1. Environment and Installation

You can follow the environment of MonoDETR.

1.1 Clone this project and create a conda environment: python>=3.8, our version of python == 3.10.8

1.2 Install pytorch and torchvision matching your CUDA version: torch >= 1.9.0, our version of torch == 1.13.1

1.3 Install requirements and compile the Deformable DETR and Attention.

    pip install -r requirements.txt

    cd lib/models/mono3dvg/ops/
    bash make.sh
    
    cd ../../../..

1.4 Download Mono3DRefer datasets and prepare the directory structure as:

    │Mono3DVG/
    ├──Mono3DRefer/
    │   ├──images/
    │   │   ├──000000.png
    │   │   ├──...
    │   ├──calib/
    │   │   ├──000000.txt
    │   │   ├──...
    │   ├──Mono3DRefer_train_image.txt
    │   ├──Mono3DRefer_val_image.txt
    │   ├──Mono3DRefer_test_image.txt
    │   ├──Mono3DRefer.json
    │   ├──test_instanceID_split.json
    ├──configs
    │   ├──mono3dvg.yaml
    │   ├──checkpoint_best_MonoDETR.pth
    ├──lib
    │   ├──datasets/
    │   │   ├──...
    │   ├──helpers/
    │   │   ├──...
    │   ├──losses/
    │   │   ├──...
    │   ├──models/
    │   │   ├──...
    ├──roberta-base
    │   ├──...
    ├──utils
    │   ├──...
    ├──outputs    #  save_path
    │   ├──mono3dvg
    │   │   ├──...
    ├──test.py
    ├──train.py

You can also change the dataset path at "root_dir" in configs/mono3dvg.yaml.

You can also change the save path at "save_path" in configs/mono3dvg.yaml.

1.5 Download pre-trained model and checkpoint

You must download the Pre-trained model of RoBERTa and MonoDETR.

You can download the checkpoint we provide to evaluate the Mono3DVG-TR model.

Models	Links	File Path	File Name
RoBERTa	model	`roberta-base\`	`pytorch_model.bin`
Pre-trained model (MonoDETR)	model	`configs\`	`checkpoint_best_MonoDETR.pth`
Best checkpoint (Mono3DVG-TR)	model	`outputs\mono3dvg\`	`checkpoint_best.pth`

2. Get Started

(1). Train

You can modify the settings of GPU, models and training in configs/mono3dvg.yaml

    CUDA_VISIBLE_DEVICES=1 python train.py

(2). Test

The best checkpoint will be evaluated as default. You can change it at "pretrain_model: 'checkpoint_best.pth'" in configs/mono3dvg.yaml:

    CUDA_VISIBLE_DEVICES=1 python test.py

👁️ Visualization

1. Qualitative results from baseline methods and our Mono3DVG-TR.

Fig.1 Blue, green, and red boxes denote the ground truth, prediction with IoU higher than 0.5, and prediction with IoU lower than 0.5, respectively.

2. Qualitative results of our Mono3DVG-TR.

Fig.2 Visualization of ’000152.png’ image’s localization results, the depth predictor’s depth maps, and the text-guided adapter’s attention score maps for our Mono3DVG-TR.

3. Qualitative results in the ’unique’(top) and ’multiple’(bottom) subsets.

Fig.3 The gray block is the traditional query without specific geometry information.

4. Qualitative results in the ’near’, ’medium’, and ’far’ subsets.

Fig.4 The gray block is the traditional query without specific geometry information.

5. Qualitative results in the ’easy’, ’moderate’, and ’hard’ subsets.

Fig.5 The gray block is the traditional query without specific geometry information.

🔍 Results

1. Comparison with baselines.

2. Results for ’near’-’medium’-’far’ subsets and ’easy’-’moderate’-’hard’ subsets.

📜 Citation

@inproceedings{zhan2024mono3dvg,
  title={Mono3DVG: 3D Visual Grounding in Monocular Images},
  author={Zhan, Yang and Yuan, Yuan and Xiong, Zhitong},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  volume={38},
  number={7},
  pages={6988--6996},
  year={2024}
}

🙏 Acknowledgement

Our code is based on (ICCV 2023)MonoDETR. We sincerely appreciate their contributions and authors for releasing source codes. I would like to thank Xiong zhitong and Yuan yuan for helping the manuscript. I also thank the School of Artificial Intelligence, OPtics, and ElectroNics (iOPEN), Northwestern Polytechnical University for supporting this work.

🤖 Contact

If you have any questions about this project, please feel free to contact [email protected].

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mono3DVG: 3D Visual Grounding in Monocular Images

Author: Yang Zhan, Yuan Yuan, Zhitong Xiong

Please share a STAR ⭐ if this project does help

📢 Latest Updates

💬 Mono3DVG: 3D Visual Grounding in Monocular Images

🌋 Dataset

Mono3DVG-TR: Architecture

📦 Mono3DVG-TR Codebase

1. Environment and Installation

1.1 Clone this project and create a conda environment: python>=3.8, our version of python == 3.10.8

1.2 Install pytorch and torchvision matching your CUDA version: torch >= 1.9.0, our version of torch == 1.13.1

1.3 Install requirements and compile the Deformable DETR and Attention.

1.4 Download Mono3DRefer datasets and prepare the directory structure as:

1.5 Download pre-trained model and checkpoint

2. Get Started

(1). Train

(2). Test

👁️ Visualization

1. Qualitative results from baseline methods and our Mono3DVG-TR.

2. Qualitative results of our Mono3DVG-TR.

3. Qualitative results in the ’unique’(top) and ’multiple’(bottom) subsets.

4. Qualitative results in the ’near’, ’medium’, and ’far’ subsets.

5. Qualitative results in the ’easy’, ’moderate’, and ’hard’ subsets.

🔍 Results

1. Comparison with baselines.

2. Results for ’near’-’medium’-’far’ subsets and ’easy’-’moderate’-’hard’ subsets.

📜 Citation

🙏 Acknowledgement

🤖 Contact

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 113 Commits
Mono3DRefer		Mono3DRefer
configs		configs
images		images
lib		lib
roberta-base		roberta-base
utils		utils
README.md		README.md
requirements.txt		requirements.txt
test.py		test.py
train.py		train.py

ZhanYang-nwpu/Mono3DVG

Folders and files

Latest commit

History

Repository files navigation

Mono3DVG: 3D Visual Grounding in Monocular Images

Author: Yang Zhan, Yuan Yuan, Zhitong Xiong

Please share a STAR ⭐ if this project does help

📢 Latest Updates

💬 Mono3DVG: 3D Visual Grounding in Monocular Images

🌋 Dataset

Mono3DVG-TR: Architecture

📦 Mono3DVG-TR Codebase

1. Environment and Installation

1.1 Clone this project and create a conda environment: python>=3.8, our version of python == 3.10.8

1.2 Install pytorch and torchvision matching your CUDA version: torch >= 1.9.0, our version of torch == 1.13.1

1.3 Install requirements and compile the Deformable DETR and Attention.

1.4 Download Mono3DRefer datasets and prepare the directory structure as:

1.5 Download pre-trained model and checkpoint

2. Get Started

(1). Train

(2). Test

👁️ Visualization

1. Qualitative results from baseline methods and our Mono3DVG-TR.

2. Qualitative results of our Mono3DVG-TR.

3. Qualitative results in the ’unique’(top) and ’multiple’(bottom) subsets.

4. Qualitative results in the ’near’, ’medium’, and ’far’ subsets.

5. Qualitative results in the ’easy’, ’moderate’, and ’hard’ subsets.

🔍 Results

1. Comparison with baselines.

2. Results for ’near’-’medium’-’far’ subsets and ’easy’-’moderate’-’hard’ subsets.

📜 Citation

🙏 Acknowledgement

🤖 Contact

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages