Name	Name	Last commit message	Last commit date
Latest commit History 197 Commits
cmake	cmake
connectivity	connectivity
data/visual_genome	data/visual_genome
include	include
models	models
pybind11 @ 86e2ad4	pybind11 @ 86e2ad4
scripts	scripts
sim_imgs	sim_imgs
src	src
tasks/R2R	tasks/R2R
webgl_imgs	webgl_imgs
.gitignore	.gitignore
.gitmodules	.gitmodules
CMakeLists.txt	CMakeLists.txt
Doxyfile	Doxyfile
LICENSE	LICENSE
README.md	README.md
README_Matterport3DSimulator.md	README_Matterport3DSimulator.md
attention_sandbox.py	attention_sandbox.py
build.sh	build.sh
cache_bottom_up.py	cache_bottom_up.py
requirements.txt	requirements.txt
teaser.jpg	teaser.jpg

Speaker-Follower Models for Vision-and-Language Navigation

This repository contains the code for the following paper:

D. Fried*, R. Hu*, V. Cirik*, A. Rohrbach, J. Andreas, L.-P. Morency, T. Berg-Kirkpatrick, K. Saenko, D. Klein**, T. Darrell**, Speaker-Follower Models for Vision-and-Language Navigation. in NIPS, 2018. (PDF)

@inproceedings{fried2018speaker,
  title={Speaker-Follower Models for Vision-and-Language Navigation},
  author={Fried, Daniel and Hu, Ronghang and Cirik, Volkan and Rohrbach, Anna and Andreas, Jacob and Morency, Louis-Philippe and Berg-Kirkpatrick, Taylor and Saenko, Kate and Klein, Dan and Darrell, Trevor},
  booktitle={Advances in Neural Information Processing Systems (NIPS)},
  year={2018}
}

(*, **: indicates equal contribution)

Project Page: https://ronghanghu.com/speaker_follower

Installation

Install Python 3 (Anaconda recommended: https://www.continuum.io/downloads).
Install PyTorch following the instructions on https://pytorch.org/ (we used PyTorch 0.3.1 in our experiments).
Download this repository or clone recursively with Git, and then enter the root directory of the repository:

# Make sure to clone with --recursive
git clone --recursive https://github.com/ronghanghu/speaker_follower.git
cd speaker_follower

If you didn't clone with the --recursive flag, then you'll need to manually clone the pybind submodule from the top-level directory:

git submodule update --init --recursive

Install the dependencies for the Matterport3D Simulator:

sudo apt-get install libopencv-dev python-opencv freeglut3 freeglut3-dev libglm-dev libjsoncpp-dev doxygen libosmesa6-dev libosmesa6 libglew-dev

Compile the Matterport3D Simulator:

mkdir build && cd build
cmake ..
make
cd ../

Note: This repository is built upon the Matterport3DSimulator codebase. Additional details on the Matterport3D Simulator can be found in README_Matterport3DSimulator.md.

Train and evaluate on the Room-to-Room (R2R) dataset

Download and preprocess the data

Download the Precomputing ResNet Image Features, and extract them into img_features/:

mkdir -p img_features/
cd img_features/
wget https://storage.googleapis.com/bringmeaspoon/img_features/ResNet-152-imagenet.zip -O ResNet-152-imagenet.zip
unzip ResNet-152-imagenet.zip
cd ..

After this step, img_features/ should contain ResNet-152-imagenet.tsv. (Note that you only need to download the features extracted from ImageNet-pretrained ResNet to run the following experiments. Places-pretrained ResNet features or actual images are not required.)

Download the R2R dataset and our sampled trajectories for data augmentation:

./tasks/R2R/data/download.sh

Training

Train the speaker model:

python tasks/R2R/train_speaker.py

Generate synthetic instructions from the trained speaker model as data augmentation:

# the path prefix to the speaker model (trained in Step 1 above)
export SPEAKER_PATH_PREFIX=tasks/R2R/speaker/snapshots/speaker_teacher_imagenet_mean_pooled_train_iter_20000

python tasks/R2R/selfplay_from_speaker.py \
    $SPEAKER_PATH_PREFIX \
    tasks/R2R/data/R2R

After this step, R2R_literal_speaker_data_augmentation_paths.json be generated under tasks/R2R/data/. This JSON file contains synthetic instructions generated by the speaker model on sampled new trajectories in the train environment (i.e. the speaker-driven data augmentation in our paper).

Alternatively, you can directly download our precomputed speaker-driven data augmentation with ./tasks/R2R/data/download_precomputed_augmentation.sh.

Train the follower model on the combination of the original and the augmented training data.

python tasks/R2R/train.py \
  --use_pretraining --pretrain_splits train literal_speaker_data_augmentation_paths

The follower will be first trained on the combination of the original train environment and the new literal_speaker_data_augmentation_paths (generated in Step 2 above) for 50000 iterations, and then fine-tuned on the original train environment for 20000 iterations. This step may take a long time. (It look approximately 50 hours using a single GPU on our local machine.)

Note

All the commands above run on a single GPU. You may choose a specific GPU by setting CUDA_VISIBLE_DEVICES environment variable (e.g. export CUDA_VISIBLE_DEVICES=1 to use GPU 1).
You may directly download our trained speaker model and follower model with

./tasks/R2R/snapshots/release/download_speaker_release.sh  # Download speaker
./tasks/R2R/snapshots/release/download_follower_release.sh  # Download follower

The scripts above will save the downloaded models under ./tasks/R2R/snapshots/release/. To use these downloaded models, set the speaker and follower path prefixes as follows:

export SPEAKER_PATH_PREFIX=tasks/R2R/snapshots/release/speaker_final_release
export FOLLOWER_PATH_PREFIX=tasks/R2R/snapshots/release/follower_final_release

Inference

Set the path prefixes for the trained speaker and follower model:

# the path prefixes to the trained speaker and follower model
# change these path prefixes if you are using downloaded models.
export SPEAKER_PATH_PREFIX=tasks/R2R/speaker/snapshots/speaker_teacher_imagenet_mean_pooled_train_iter_20000
export FOLLOWER_PATH_PREFIX=tasks/R2R/snapshots/follower_with_pretraining_sample_imagenet_mean_pooled_train_iter_11100

Generate top-ranking trajectory predictions with pragmatic inference:

# Specify the path prefix to the output evaluation file
export EVAL_FILE_PREFIX=tasks/R2R/eval_outputs/pragmatics

CUDA_VISIBLE_DEVICES=0 python tasks/R2R/rational_follower.py \
    $FOLLOWER_PATH_PREFIX \
    $SPEAKER_PATH_PREFIX \
    --batch_size 15 --beam_size 40 --state_factored_search \
    --use_test_set \
    --eval_file $EVAL_FILE_PREFIX

This will generate the prediction files in the directory of EVAL_FILE_PREFIX, and also print the performance on val_seen and val_unseen splits. (The displayed performance will be zero on the test split, since the test JSON file does not contain ground-truth target locations.) The predicted trajectories with the above script contain only the top-scoring trajectories among all candidate trajectories, ranked with pragmatic inference.

For participating in the Vision-and-Language Navigation Challenge, add --physical_traversal option to generate physically-plausible trajectory predictions with pragmatic inference:

# Specify the path prefix to the output evaluation file
export EVAL_FILE_PREFIX=tasks/R2R/eval_outputs/pragmatics_physical

CUDA_VISIBLE_DEVICES=0 python tasks/R2R/rational_follower.py \
    $FOLLOWER_PATH_PREFIX \
    $SPEAKER_PATH_PREFIX \
    --batch_size 15 --beam_size 40 --state_factored_search \
    --use_test_set --physical_traversal \
    --eval_file $EVAL_FILE_PREFIX

This will generate the prediction files in the directory of EVAL_FILE_PREFIX. These prediction files can be submitted to https://evalai.cloudcv.org/web/challenges/challenge-page/97/overview for evaluation.

The major difference with --physical_traversal is that now the generated trajectories contain all states visited by the search algorithm in the order they are traversed. The agent expands each route one step forward at a time, and then switches to expand the next route. The details are explained in Appendix E in our paper.

Acknowledgements

This repository is built upon the Matterport3DSimulator codebase.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Speaker-Follower Models for Vision-and-Language Navigation

Installation

Train and evaluate on the Room-to-Room (R2R) dataset

Download and preprocess the data

Training

Note

Inference

Acknowledgements

About

Releases

Packages

Contributors 2

Languages

License

ronghanghu/speaker_follower

Folders and files

Latest commit

History

Repository files navigation

Speaker-Follower Models for Vision-and-Language Navigation

Installation

Train and evaluate on the Room-to-Room (R2R) dataset

Download and preprocess the data

Training

Note

Inference

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages