Skip to content
/ visitron Public

VISITRON: A multi-modal Transformer-based model for Cooperative Vision-and-Dialog Navigation (CVDN)


Notifications You must be signed in to change notification settings


VISITRON: Visual Semantics-aligned Interactively Trained Object-Navigator

VISITRON: Visual Semantics-aligned Interactively Trained Object-Navigator

Ayush Shrivastava, Karthik Gopalakrishnan, Yang Liu, Robinson Piramuthu, Gokhan Tür, Devi Parikh, Dilek Hakkani-Tür

Accepted to:

  1. Findings of ACL 2022
  2. NAACL 2021, Visually Grounded Interaction and Language (ViGIL) Workshop



Clone the repo using:

git clone --recursive

Matterport3D Dataset and Simulator

This codebase uses the Matterport3D Simulator. Detailed instructions on how to setup the simulator and how to preprocess the Matterport3D data for faster simulator performance are present here: Matterport3DSimulator_README. We provide the Docker setup for ease of setup for the simulator.

We assume that the Matterport3D is present at $MATTERPORT_DATA_DIR which can be set using:


Docker Setup

Build the Docker image:

docker build -t mattersim:visitron .

To run the Docker container and mount the codebase and the Matterport3D dataset, use:

nvidia-docker run -it --ipc=host --cpuset-cpus="$(taskset -c -p $$ | cut -f2 -d ':' | awk '{$1=$1};1')" --volume `pwd`:/root/mount/Matterport3DSimulator --mount type=bind,source=$MATTERPORT_DATA_DIR,target=/root/mount/Matterport3DSimulator/data/v1/scans,readonly mattersim:visitron

Task Data Setup

NDH and R2R

mkdir -p srv/task_data
bash scripts/


Refer to RxR repo for its setup and copy the data to srv/task_data/RxR/data folder.


Inside the docker container, run these commands to generate pre-training data.

For NDH, run

python scripts/ --dataset_to_use NDH --split train

For R2R, run

python scripts/ --dataset_to_use NDH --split train

The data gets saved to srv/task_data/pretrain_data. By default, this script starts 8 multiprocessing threads to speed up its execution. --start_job_index, --end_job_index and --global_total_jobs can be changed to change the number of threads.

Image Features

Our pre-training approach requires object-level features from Faster R-CNN concatenated with orientation features.

First, follow the setup from the bottom-up attention repo inside a docker container to install Caffe. Note that the code from bottom-up attention repo requires python2.

Then, extract object-level features using

python2 scripts/

You can use --gpu_id to parallelize the feature extraction process ovewr multiple GPUs.

Then, to concatenate orientation features to object-level features, use

python scripts/

During fine-tuning, we use scene-level ResNet features. Download ResNet features from this link. You can also extract using

python scripts/

VISITRON Initialization

Before performing navigation-specific pre-training and fine-tuning, we initialize VISITRON with disembodied weights from the Oscar model. Download the Oscar pre-trained weights using

unzip $ -d srv/oscar_weights/

where $MODEL_NAME is base-vg-labels and base-no-labels.


We provide pre-training, training and evaluation scripts in run_scripts/.

As an example, use the following command to run a script.

bash run_scripts/viewpoint_train/ $MODE

where $MODE can be from [cpu, single-gpu 0, multi-gpu-dp, multi-gpu-ddp].

  • Use cpu to train on CPU.
  • Use single-gpu 0 to train on a single GPU. To use other GPU, change 0 to another value.
  • Use multi-gpu-dp to train on all available GPUs using DataParallel.
  • Use multi-gpu-ddp to train on 4 GPUs using DistributedDataParallel. Change --nproc_per_node in the script to specify no. of GPUs in DistributedDataParallel mode.

Pretraining scripts are in pretrain, training scripts which use viewpoint selection as action space are in viewpoint_train, turn based action space scripts are in turn_based_train and scripts for training and evaluating question-asking classifier are in classifier. ablations are the training scripts for Table 1 from the paper.

To pretrain our model on NDH and R2R, and finetune on NDH and RxR for viewpoint selection action space, run

  1. Pretrain on NDH+R2R for all objectives:
bash run_scripts/pretrain/ multi-gpu-ddp
  1. Pick the best pretrained checkpoint by evaluating using
bash run_scripts/pretrain/ multi-gpu-ddp

Change --model_name_or_path in run_scripts/viewpoint_train/ to load the best pretrained checkpoint.

  1. Finetune on NDH + RxR using
bash run_scripts/viewpoint_train/ multi-gpu-ddp
  1. Evaluate trained models using
bash run_scripts/viewpoint_train/ multi-gpu-ddp

You can then, run the scripts in run_scripts/classifier to train the question-asking classifier.

For any run script, make sure these arguments refer to correct paths, img_feat_dir, img_feature_file, data_dir, model_name_or_path, output_dir.


This library is licensed under the MIT-0 License. See the LICENSE file.


  title={VISITRON: Visual Semantics-Aligned Interactively Trained Object-Navigator},
  author={Shrivastava, Ayush and Gopalakrishnan, Karthik and Liu, Yang and Piramuthu, Robinson and T{\"u}r, G{\"o}khan and Parikh, Devi and Hakkani-Tur, Dilek},
  booktitle={Findings of the Association for Computational Linguistics: ACL 2022},


VISITRON: A multi-modal Transformer-based model for Cooperative Vision-and-Dialog Navigation (CVDN)



Code of conduct

Security policy





No releases published
