update README

ronghanghu · Sep 17, 2018 · b162873 · b162873
1 parent 52285ca
commit b162873
Show file tree

Hide file tree

Showing 2 changed files with 78 additions and 155 deletions.
diff --git a/README.md b/README.md
@@ -30,201 +30,124 @@ If you didn't clone with the `--recursive` flag, then you'll need to manually cl
 ```
 git submodule update --init --recursive
 ```
-4. Compile the Matterport3D Simulator:
+4. Install the dependencies for the Matterport3D Simulator:
+```
+sudo apt-get install libopencv-dev python-opencv freeglut3 freeglut3-dev libglm-dev libjsoncpp-dev doxygen libosmesa6-dev libosmesa6 libglew-dev
+```
+5. Compile the Matterport3D Simulator:
 ```
 mkdir build && cd build
 cmake ..
 make
 cd ../
 ```
 
+*Note:* This repository is built upon the [Matterport3DSimulator](https://github.com/peteanderson80/Matterport3DSimulator) codebase. Additional details on the Matterport3D Simulator can be found in [`README_Matterport3DSimulator.md`](README_Matterport3DSimulator.md).
+
 ## Train and evaluate on the Room-to-Room (R2R) dataset
 
 ### Download and preprocess the data
 
-1. Download the CLEVR dataset from http:https://cs.stanford.edu/people/jcjohns/clevr/, and symbol link it to `exp_clevr_snmn/clevr_dataset`. After this step, the file structure should look like
+1. Download the Precomputing ResNet Image Features, and extract them into `img_features/`:
 ```
-exp_clevr_snmn/clevr_dataset/
- images/
- train/
- CLEVR_train_000000.png
- ...
- val/
- test/
- questions/
- CLEVR_train_questions.json
- CLEVR_val_questions.json
- CLEVR_test_questions.json
- ...
+mkdir -p img_features/
+cd img_features/
+wget https://storage.googleapis.com/bringmeaspoon/img_features/ResNet-152-imagenet.zip -O ResNet-152-imagenet.zip
+unzip ResNet-152-imagenet.zip
+cd ..
 ```
+After this step, `img_features/` should contain `ResNet-152-imagenet.tsv`. (Note that you only need to download the features extracted from ImageNet-pretrained ResNet to run the following experiments. Places-pretrained ResNet features or actual images are not required.)
 
-(Optional) If you want to run any experiments on the CLEVR-Ref dataset for the referential expression grounding task, you can download it from [here](http:https://people.eecs.berkeley.edu/~ronghang/projects/snmn/CLEVR_loc.tgz), and symbol link it to `exp_clevr_snmn/clevr_loc_dataset`. After this step, the file structure should look like
+2. Download the R2R dataset and our sampled trajectories for data augmentation:
 ```
-exp_clevr_snmn/clevr_loc_dataset/
- images/
- loc_train/
- CLEVR_loc_train_000000.png
- ...
- loc_val/
- loc_test/
- questions/
- CLEVR_loc_train_questions.json
- CLEVR_loc_val_questions.json
- CLEVR_loc_test_questions.json
- ...
-```
-
-2. Extract visual features from the images and store them on the disk. In our experiments, we extract visual features using ResNet-101 C4 block. Then, construct the "expert layout" from ground-truth functional programs, and build image collections (imdb) for CLEVR (and CLEVR-Ref). These procedures can be down as follows.
+./tasks/R2R/data/download.sh
 ```
-./exp_clevr_snmn/tfmodel/resnet/download_resnet_v1_101.sh # download ResNet-101
 
-cd ./exp_clevr_snmn/data/
-python extract_resnet101_c4.py # feature extraction
-python get_ground_truth_layout.py # construct expert policy
-python build_clevr_imdb.py # build image collections
-cd ../../
+### Training
 
-# (Optional, if you want to run on the CLEVR-Ref dataset)
-cd ./exp_clevr_snmn/data/
-python extract_resnet101_c4_loc.py # feature extraction
-python get_ground_truth_layout_loc.py # construct expert policy
-python build_clevr_imdb_loc.py # build image collections
-cd ../../
+1. Train the speaker model: 
+```
+python tasks/R2R/train_speaker.py
 ```
 
-### Training
+2. Generate synthetic instructions from the trained speaker model as data augmentation:
+```
+# the path prefix to the speaker model (trained in Step 1 above)
+export SPEAKER_PATH_PREFIX=tasks/R2R/speaker/snapshots/speaker_teacher_imagenet_mean_pooled_train_iter_20000
 
-0. Add the root of this repository to PYTHONPATH: `export PYTHONPATH=.:$PYTHONPATH` 
+python tasks/R2R/selfplay_from_speaker.py \
+ $SPEAKER_PATH_PREFIX \
+ tasks/R2R/data/R2R
+```
+After this step, `R2R_literal_speaker_data_augmentation_paths.json` be generated under `tasks/R2R/data/`. This JSON file contains synthetic instructions generated by the speaker model on sampled new trajectories in the train environment (i.e. the speaker-driven data augmentation in our paper).
 
-1. Train on the CLEVR dataset for VQA: 
- - with ground-truth layout 
-`python exp_clevr_snmn/train_net_vqa.py --cfg exp_clevr_snmn/cfgs/vqa_gt_layout.yaml` 
- - without ground-truth layout 
-`python exp_clevr_snmn/train_net_vqa.py --cfg exp_clevr_snmn/cfgs/vqa_scratch.yaml`
+Alternatively, you can directly download our precomputed speaker-driven data augmentation with
+`./tasks/R2R/data/download_precomputed_augmentation.sh`.
 
-2. (Optional) Train on the CLEVR-Ref dataset for the REF task: 
- - with ground-truth layout 
-`python exp_clevr_snmn/train_net_loc.py --cfg exp_clevr_snmn/cfgs/loc_gt_layout.yaml` 
- - without ground-truth layout 
-`python exp_clevr_snmn/train_net_loc.py --cfg exp_clevr_snmn/cfgs/loc_scratch.yaml`
+3. Train the follower model on the combination of the original and the augmented training data.
+```
+python tasks/R2R/train.py \
+ --use_pretraining --pretrain_splits train literal_speaker_data_augmentation_paths
+```
+The follower will be first trained on the combination of the original `train` environment and the new `literal_speaker_data_augmentation_paths` (generated in Step 2 above) for 50000 iterations, and then fine-tuned on the
+original `train` environment for 20000 iterations.
 
-3. (Optional) Train jointly on the CLEVR and CLEVR-Ref datasets for VQA and REF tasks: 
- - with ground-truth layout 
-`python exp_clevr_snmn/train_net_joint.py --cfg exp_clevr_snmn/cfgs/joint_gt_layout.yaml` 
- - without ground-truth layout 
-`python exp_clevr_snmn/train_net_joint.py --cfg exp_clevr_snmn/cfgs/joint_scratch.yaml`
+This step may take a long time. (It look approximately 50 hours using a single GPU on our local machine.)
 
 Note:
-* By default, the above scripts use GPU 0. To run on a different GPU, append `GPU_ID` parameter to the commands above (e.g. appending `GPU_ID 2` to use GPU 2). During training, the script will write TensorBoard events to `exp_clevr_snmn/tb/{exp_name}/` and save the snapshots under `exp_clevr_snmn/tfmodel/{exp_name}/`.
-* When training without ground-truth layout, there is some variance in performance between each run, and training sometimes gets stuck in poor local minima. In our experiments, before evalutating on the test split, we took 4 trials and selected the best one based on validation performance.
-
-### Test
-
-0. Add the root of this repository to PYTHONPATH: `export PYTHONPATH=.:$PYTHONPATH` 
-
-1. Evaluate on the CLEVR dataset for the VQA task: 
-`python exp_clevr_snmn/test_net_vqa.py --cfg exp_clevr_snmn/cfgs/{exp_name}.yaml TEST.ITER 200000` 
-where `{exp_name}` should be one of `vqa_gt_layout`, `vqa_scratch`, `joint_gt_layout` and `joint_scratch`. 
-*Expected accuracy: 96.6% for `vqa_gt_layout`, 93.0% for `vqa_scratch`, 96.5% for `joint_gt_layout`, 93.9% for `joint_scratch`.* Note:
- - The above evaluation script will print out the accuracy (only for val split) and also save it under `exp_clevr_snmn/results/{exp_name}/`. It will also save a prediction output file in this directory. 
- - The above evaluation script will generate 100 visualizations by default, and save it under `exp_clevr_snmn/results/{exp_name}/`. You may change the number of visualizations with `TEST.NUM_VIS` parameter (e.g. appending `TEST.NUM_VIS 400` to the commands above to generate 400 visualizations).
- - By default, the above script evaluates on the *validation* split of CLEVR. To evaluate on the *test* split, append `TEST.SPLIT_VQA test` to the command above. As there is no ground-truth answers for *test* split in the downloaded CLEVR data, **the displayed accuracy will be zero on the test split**. You may email the prediction outputs in `exp_clevr_snmn/results/{exp_name}/` to the CLEVR dataset authors for the *test* split accuracy. 
- - By default, the above script uses GPU 0. To run on a different GPU, append `GPU_ID` parameter to the commands above (e.g. appending `GPU_ID 2` to use GPU 2). 
-
-2. (Optional) Evaluate on the CLEVR-Ref dataset for the REF task: 
-`python exp_clevr_snmn/test_net_loc.py --cfg exp_clevr_snmn/cfgs/{exp_name}.yaml TEST.ITER 200000` 
-where `{exp_name}` should be one of `loc_gt_layout`, `loc_scratch`, `joint_gt_layout` and `joint_scratch`. 
-*Expected accuracy (Precision@1): 96.0% for `loc_gt_layout`, 93.4% for `loc_scratch`, 96.2% for `joint_gt_layout`, 95.4% for `joint_scratch`.* Note:
- - The above evaluation script will print out the accuracy (Precision@1) and also save it under `exp_clevr_snmn/results/{exp_name}/`. 
- - The above evaluation script will generate 100 visualizations by default, and save it under `exp_clevr_snmn/results/{exp_name}/`. You may change the number of visualizations with `TEST.NUM_VIS` parameter (e.g. appending `TEST.NUM_VIS 400` to the commands above to generate 400 visualizations).
- - By default, the above script evaluates on the *validation* split of CLEVR-Ref. To evaluate on the *test* split, append `TEST.SPLIT_LOC loc_test` to the command above. 
- - By default, the above script uses GPU 0. To run on a different GPU, append `GPU_ID` parameter to the commands above (e.g. appending `GPU_ID 2` to use GPU 2). 
-
-## Train and evaluate on the VQAv1 and VQAv2 datasets
-
-### Download and preprocess the data
-
-1. Download the VQAv1 and VQAv2 dataset annotations from http:https://www.visualqa.org/download.html, and symbol link them to `exp_vqa/vqa_dataset`. After this step, the file structure should look like
+* All the commands above run on a single GPU. You may choose a specific GPU by setting `CUDA_VISIBLE_DEVICES` environment variable (e.g. `export CUDA_VISIBLE_DEVICES=1` to use GPU 1).
+* You may directly download our trained speaker model and follower model with
 ```
-exp_vqa/vqa_dataset/
- Questions/
- OpenEnded_mscoco_train2014_questions.json
- OpenEnded_mscoco_val2014_questions.json
- OpenEnded_mscoco_test-dev2015_questions.json
- OpenEnded_mscoco_test2015_questions.json
- v2_OpenEnded_mscoco_train2014_questions.json
- v2_OpenEnded_mscoco_val2014_questions.json
- v2_OpenEnded_mscoco_test-dev2015_questions.jso
- v2_OpenEnded_mscoco_test2015_questions.json
- Annotations/
- mscoco_train2014_annotations.json
- mscoco_val2014_annotations.json
- v2_mscoco_train2014_annotations.json
- v2_mscoco_val2014_annotations.json
- v2_mscoco_train2014_complementary_pairs.json
- v2_mscoco_val2014_complementary_pairs.json
+./tasks/R2R/snapshots/release/download_speaker_release.sh # Download speaker
+./tasks/R2R/snapshots/release/download_follower_release.sh # Download follower
 ```
-
-2. Download the COCO images from http:https://mscoco.org/, and symbol link it to `exp_vqa/coco_dataset`. After this step, the file structure should look like
+The scripts above will save the downloaded models under `./tasks/R2R/snapshots/release/`. To use these downloaded models, set the speaker and follower path prefixes as follows:
 ```
-exp_vqa/coco_dataset/
- images/
- train2014/
- COCO_train2014_000000000009.jpg
- ...
- val2014/
- test2015/
- ...
+export SPEAKER_PATH_PREFIX=tasks/R2R/snapshots/release/speaker_final_release
+export FOLLOWER_PATH_PREFIX=tasks/R2R/snapshots/release/follower_final_release
 ```
 
-3. Extract visual features from the images and store them on the disk. In our experiments, we extract visual features using ResNet-152 C5 block. Then, build image collections (imdb) for VQAv1 and VQAv2. These procedures can be down as follows.
+### Test
 
+1. Set the path prefixes for the trained speaker and follower model: 
 ```
-./exp_vqa/tfmodel/resnet/download_resnet_v1_152.sh # Download ResNet-152
-
-cd ./exp_vqa/data/
-python extract_resnet152_c5_7x7.py # feature extraction for all COCO images
-python build_vqa_imdb_r152_7x7.py # build image collections for VQAv1
-python build_vqa_imdb_r152_7x7_vqa_v2.py # build image collections for VQAv2
-cd ../../
+# the path prefixes to the trained speaker and follower model
+# change these path prefixes if you are using downloaded models.
+export SPEAKER_PATH_PREFIX=tasks/R2R/speaker/snapshots/speaker_teacher_imagenet_mean_pooled_train_iter_20000
+export FOLLOWER_PATH_PREFIX=tasks/R2R/snapshots/follower_with_pretraining_sample_imagenet_mean_pooled_train_iter_11100
 ```
-(Note that this repository already contains the "expert layout" from parsing results using Stanford Parser. They are the same as in [N2NMN](http:https://ronghanghu.com/n2nmn).)
-
-### Training
 
-0. Add the root of this repository to PYTHONPATH: `export PYTHONPATH=.:$PYTHONPATH` 
-
-1. Train on the VQAv1 dataset: 
- - with ground-truth layout 
-`python exp_vqa/train_net_vqa.py --cfg exp_vqa/cfgs/vqa_v1_gt_layout.yaml` 
- - without ground-truth layout 
-`python exp_vqa/train_net_vqa.py --cfg exp_vqa/cfgs/vqa_v1_scratch.yaml` 
-
-2. Train on the VQAv2 dataset: 
- - with ground-truth layout 
-`python exp_vqa/train_net_vqa.py --cfg exp_vqa/cfgs/vqa_v2_gt_layout.yaml` 
- - without ground-truth layout 
-`python exp_vqa/train_net_vqa.py --cfg exp_vqa/cfgs/vqa_v2_scratch.yaml` 
+2. Generate top-ranking trajectory predictions with pragmatic inference:
+```
+# Specify the path prefix to the output evaluation file
+export EVAL_FILE_PREFIX=tasks/R2R/eval_outputs/pragmatics
 
-Note:
-* By default, the above scripts use GPU 0, and train on the union of *train2014* and *val2014* splits. To run on a different GPU, append `GPU_ID` parameter to the commands above (e.g. appending `GPU_ID 2` to use GPU 2). During training, the script will write TensorBoard events to `exp_vqa/tb/{exp_name}/` and save the snapshots under `exp_vqa/tfmodel/{exp_name}/`.
+CUDA_VISIBLE_DEVICES=0 python tasks/R2R/rational_follower.py \
+ $FOLLOWER_PATH_PREFIX \
+ $SPEAKER_PATH_PREFIX \
+ --batch_size 15 --beam_size 40 --state_factored_search \
+ --use_test_set \
+ --eval_file $EVAL_FILE_PREFIX
+```
+This will generate the prediction files in the directory of `EVAL_FILE_PREFIX`, and also print the performance on `val_seen` and `val_unseen` splits.
 
-### Test
+The predicted trajectories with the above script contain only the **top-scoring trajectories** among all candidate trajectories, ranked with pragmatic inference.
 
-0. Add the root of this repository to PYTHONPATH: `export PYTHONPATH=.:$PYTHONPATH` 
+3. For participating in the [Vision-and-Language Navigation Challenge](https://evalai.cloudcv.org/web/challenges/challenge-page/97/overview), add `--physical_traversal` option to generate physically-plausible trajectory predictions with pragmatic inference:
+```
+# Specify the path prefix to the output evaluation file
+export EVAL_FILE_PREFIX=tasks/R2R/eval_outputs/pragmatics_physical
 
-1. Evaluate on the VQAv1 dataset: 
-`python exp_vqa/test_net_vqa.py --cfg exp_vqa/cfgs/{exp_name}.yaml TEST.ITER 20000` 
-where `{exp_name}` should be one of `vqa_v1_gt_layout` and `vqa_v1_scratch`. Note:
- - By default, the above script evaluates on the *test-dev2015* split of VQAv1. To evaluate on the *test2015* split, append `TEST.SPLIT_VQA test2015` to the command above.
- - By default, the above script uses GPU 0. To run on a different GPU, append `GPU_ID` parameter to the commands above (e.g. appending `GPU_ID 2` to use GPU 2). 
- - The above evaluation script will not print out the accuracy (**the displayed accuracy will be zero**), but will write the prediction outputs to `exp_vqa/eval_outputs/{exp_name}/`, which can be uploaded to the evaluation sever (http:https://www.visualqa.org/roe.html) for evaluation. *Expected accuracy: 66.0% for `vqa_v1_gt_layout`, 65.5% for `vqa_v1_scratch`.* 
+CUDA_VISIBLE_DEVICES=0 python tasks/R2R/rational_follower.py \
+ $FOLLOWER_PATH_PREFIX \
+ $SPEAKER_PATH_PREFIX \
+ --batch_size 15 --beam_size 40 --state_factored_search \
+ --use_test_set --physical_traversal \
+ --eval_file $EVAL_FILE_PREFIX
+```
+This will generate the prediction files in the directory of `EVAL_FILE_PREFIX`. These prediction files can be submitted to https://evalai.cloudcv.org/web/challenges/challenge-page/97/overview for evaluation.
 
-2. Evaluate on the VQAv2 dataset: 
-`python exp_vqa/test_net_vqa.py --cfg exp_vqa/cfgs/{exp_name}.yaml TEST.ITER 40000` 
-where `{exp_name}` should be one of `vqa_v2_gt_layout` and `vqa_v2_scratch`. Note:
- - By default, the above script uses GPU 0. To run on a different GPU, append `GPU_ID` parameter to the commands above (e.g. appending `GPU_ID 2` to use GPU 2). 
- - The above evaluation script will not print out the accuracy (**the displayed accuracy will be zero**), but will write the prediction outputs to `exp_vqa/eval_outputs_vqa_v2/{exp_name}/`, which can be uploaded to the evaluation sever (http:https://www.visualqa.org/roe.html) for evaluation. *Expected accuracy: 64.0% for `vqa_v2_gt_layout`, 64.1% for `vqa_v2_scratch`.* 
+The major difference with `--physical_traversal` is that now the generated trajectories contain **all states visited by the search algorithm in the order they are traversed**. The agent expands each route one step forward at a time, and then switches to expand the next route. The details are explained in Appendix E in [our paper](https://arxiv.org/pdf/1806.02724.pdf).
 
 ## Acknowledgements
 

diff --git a/tasks/R2R/eval_outputs/.keep b/tasks/R2R/eval_outputs/.keep