Skip to content

Commit

Permalink
update README
Browse files Browse the repository at this point in the history
  • Loading branch information
ronghanghu committed Sep 17, 2018
1 parent 52285ca commit b162873
Show file tree
Hide file tree
Showing 2 changed files with 78 additions and 155 deletions.
233 changes: 78 additions & 155 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,201 +30,124 @@ If you didn't clone with the `--recursive` flag, then you'll need to manually cl
```
git submodule update --init --recursive
```
4. Compile the Matterport3D Simulator:
4. Install the dependencies for the Matterport3D Simulator:
```
sudo apt-get install libopencv-dev python-opencv freeglut3 freeglut3-dev libglm-dev libjsoncpp-dev doxygen libosmesa6-dev libosmesa6 libglew-dev
```
5. Compile the Matterport3D Simulator:
```
mkdir build && cd build
cmake ..
make
cd ../
```

*Note:* This repository is built upon the [Matterport3DSimulator](https://github.com/peteanderson80/Matterport3DSimulator) codebase. Additional details on the Matterport3D Simulator can be found in [`README_Matterport3DSimulator.md`](README_Matterport3DSimulator.md).

## Train and evaluate on the Room-to-Room (R2R) dataset

### Download and preprocess the data

1. Download the CLEVR dataset from http:https://cs.stanford.edu/people/jcjohns/clevr/, and symbol link it to `exp_clevr_snmn/clevr_dataset`. After this step, the file structure should look like
1. Download the Precomputing ResNet Image Features, and extract them into `img_features/`:
```
exp_clevr_snmn/clevr_dataset/
images/
train/
CLEVR_train_000000.png
...
val/
test/
questions/
CLEVR_train_questions.json
CLEVR_val_questions.json
CLEVR_test_questions.json
...
mkdir -p img_features/
cd img_features/
wget https://storage.googleapis.com/bringmeaspoon/img_features/ResNet-152-imagenet.zip -O ResNet-152-imagenet.zip
unzip ResNet-152-imagenet.zip
cd ..
```
After this step, `img_features/` should contain `ResNet-152-imagenet.tsv`. (Note that you only need to download the features extracted from ImageNet-pretrained ResNet to run the following experiments. Places-pretrained ResNet features or actual images are not required.)

(Optional) If you want to run any experiments on the CLEVR-Ref dataset for the referential expression grounding task, you can download it from [here](http:https://people.eecs.berkeley.edu/~ronghang/projects/snmn/CLEVR_loc.tgz), and symbol link it to `exp_clevr_snmn/clevr_loc_dataset`. After this step, the file structure should look like
2. Download the R2R dataset and our sampled trajectories for data augmentation:
```
exp_clevr_snmn/clevr_loc_dataset/
images/
loc_train/
CLEVR_loc_train_000000.png
...
loc_val/
loc_test/
questions/
CLEVR_loc_train_questions.json
CLEVR_loc_val_questions.json
CLEVR_loc_test_questions.json
...
```

2. Extract visual features from the images and store them on the disk. In our experiments, we extract visual features using ResNet-101 C4 block. Then, construct the "expert layout" from ground-truth functional programs, and build image collections (imdb) for CLEVR (and CLEVR-Ref). These procedures can be down as follows.
./tasks/R2R/data/download.sh
```
./exp_clevr_snmn/tfmodel/resnet/download_resnet_v1_101.sh # download ResNet-101

cd ./exp_clevr_snmn/data/
python extract_resnet101_c4.py # feature extraction
python get_ground_truth_layout.py # construct expert policy
python build_clevr_imdb.py # build image collections
cd ../../
### Training

# (Optional, if you want to run on the CLEVR-Ref dataset)
cd ./exp_clevr_snmn/data/
python extract_resnet101_c4_loc.py # feature extraction
python get_ground_truth_layout_loc.py # construct expert policy
python build_clevr_imdb_loc.py # build image collections
cd ../../
1. Train the speaker model:
```
python tasks/R2R/train_speaker.py
```

### Training
2. Generate synthetic instructions from the trained speaker model as data augmentation:
```
# the path prefix to the speaker model (trained in Step 1 above)
export SPEAKER_PATH_PREFIX=tasks/R2R/speaker/snapshots/speaker_teacher_imagenet_mean_pooled_train_iter_20000
0. Add the root of this repository to PYTHONPATH: `export PYTHONPATH=.:$PYTHONPATH`
python tasks/R2R/selfplay_from_speaker.py \
$SPEAKER_PATH_PREFIX \
tasks/R2R/data/R2R
```
After this step, `R2R_literal_speaker_data_augmentation_paths.json` be generated under `tasks/R2R/data/`. This JSON file contains synthetic instructions generated by the speaker model on sampled new trajectories in the train environment (i.e. the speaker-driven data augmentation in our paper).

1. Train on the CLEVR dataset for VQA:
- with ground-truth layout
`python exp_clevr_snmn/train_net_vqa.py --cfg exp_clevr_snmn/cfgs/vqa_gt_layout.yaml`
- without ground-truth layout
`python exp_clevr_snmn/train_net_vqa.py --cfg exp_clevr_snmn/cfgs/vqa_scratch.yaml`
Alternatively, you can directly download our precomputed speaker-driven data augmentation with
`./tasks/R2R/data/download_precomputed_augmentation.sh`.

2. (Optional) Train on the CLEVR-Ref dataset for the REF task:
- with ground-truth layout
`python exp_clevr_snmn/train_net_loc.py --cfg exp_clevr_snmn/cfgs/loc_gt_layout.yaml`
- without ground-truth layout
`python exp_clevr_snmn/train_net_loc.py --cfg exp_clevr_snmn/cfgs/loc_scratch.yaml`
3. Train the follower model on the combination of the original and the augmented training data.
```
python tasks/R2R/train.py \
--use_pretraining --pretrain_splits train literal_speaker_data_augmentation_paths
```
The follower will be first trained on the combination of the original `train` environment and the new `literal_speaker_data_augmentation_paths` (generated in Step 2 above) for 50000 iterations, and then fine-tuned on the
original `train` environment for 20000 iterations.

3. (Optional) Train jointly on the CLEVR and CLEVR-Ref datasets for VQA and REF tasks:
- with ground-truth layout
`python exp_clevr_snmn/train_net_joint.py --cfg exp_clevr_snmn/cfgs/joint_gt_layout.yaml`
- without ground-truth layout
`python exp_clevr_snmn/train_net_joint.py --cfg exp_clevr_snmn/cfgs/joint_scratch.yaml`
This step may take a long time. (It look approximately 50 hours using a single GPU on our local machine.)

Note:
* By default, the above scripts use GPU 0. To run on a different GPU, append `GPU_ID` parameter to the commands above (e.g. appending `GPU_ID 2` to use GPU 2). During training, the script will write TensorBoard events to `exp_clevr_snmn/tb/{exp_name}/` and save the snapshots under `exp_clevr_snmn/tfmodel/{exp_name}/`.
* When training without ground-truth layout, there is some variance in performance between each run, and training sometimes gets stuck in poor local minima. In our experiments, before evalutating on the test split, we took 4 trials and selected the best one based on validation performance.

### Test

0. Add the root of this repository to PYTHONPATH: `export PYTHONPATH=.:$PYTHONPATH`

1. Evaluate on the CLEVR dataset for the VQA task:
`python exp_clevr_snmn/test_net_vqa.py --cfg exp_clevr_snmn/cfgs/{exp_name}.yaml TEST.ITER 200000`
where `{exp_name}` should be one of `vqa_gt_layout`, `vqa_scratch`, `joint_gt_layout` and `joint_scratch`.
*Expected accuracy: 96.6% for `vqa_gt_layout`, 93.0% for `vqa_scratch`, 96.5% for `joint_gt_layout`, 93.9% for `joint_scratch`.* Note:
- The above evaluation script will print out the accuracy (only for val split) and also save it under `exp_clevr_snmn/results/{exp_name}/`. It will also save a prediction output file in this directory.
- The above evaluation script will generate 100 visualizations by default, and save it under `exp_clevr_snmn/results/{exp_name}/`. You may change the number of visualizations with `TEST.NUM_VIS` parameter (e.g. appending `TEST.NUM_VIS 400` to the commands above to generate 400 visualizations).
- By default, the above script evaluates on the *validation* split of CLEVR. To evaluate on the *test* split, append `TEST.SPLIT_VQA test` to the command above. As there is no ground-truth answers for *test* split in the downloaded CLEVR data, **the displayed accuracy will be zero on the test split**. You may email the prediction outputs in `exp_clevr_snmn/results/{exp_name}/` to the CLEVR dataset authors for the *test* split accuracy.
- By default, the above script uses GPU 0. To run on a different GPU, append `GPU_ID` parameter to the commands above (e.g. appending `GPU_ID 2` to use GPU 2).

2. (Optional) Evaluate on the CLEVR-Ref dataset for the REF task:
`python exp_clevr_snmn/test_net_loc.py --cfg exp_clevr_snmn/cfgs/{exp_name}.yaml TEST.ITER 200000`
where `{exp_name}` should be one of `loc_gt_layout`, `loc_scratch`, `joint_gt_layout` and `joint_scratch`.
*Expected accuracy (Precision@1): 96.0% for `loc_gt_layout`, 93.4% for `loc_scratch`, 96.2% for `joint_gt_layout`, 95.4% for `joint_scratch`.* Note:
- The above evaluation script will print out the accuracy (Precision@1) and also save it under `exp_clevr_snmn/results/{exp_name}/`.
- The above evaluation script will generate 100 visualizations by default, and save it under `exp_clevr_snmn/results/{exp_name}/`. You may change the number of visualizations with `TEST.NUM_VIS` parameter (e.g. appending `TEST.NUM_VIS 400` to the commands above to generate 400 visualizations).
- By default, the above script evaluates on the *validation* split of CLEVR-Ref. To evaluate on the *test* split, append `TEST.SPLIT_LOC loc_test` to the command above.
- By default, the above script uses GPU 0. To run on a different GPU, append `GPU_ID` parameter to the commands above (e.g. appending `GPU_ID 2` to use GPU 2).

## Train and evaluate on the VQAv1 and VQAv2 datasets

### Download and preprocess the data

1. Download the VQAv1 and VQAv2 dataset annotations from http:https://www.visualqa.org/download.html, and symbol link them to `exp_vqa/vqa_dataset`. After this step, the file structure should look like
* All the commands above run on a single GPU. You may choose a specific GPU by setting `CUDA_VISIBLE_DEVICES` environment variable (e.g. `export CUDA_VISIBLE_DEVICES=1` to use GPU 1).
* You may directly download our trained speaker model and follower model with
```
exp_vqa/vqa_dataset/
Questions/
OpenEnded_mscoco_train2014_questions.json
OpenEnded_mscoco_val2014_questions.json
OpenEnded_mscoco_test-dev2015_questions.json
OpenEnded_mscoco_test2015_questions.json
v2_OpenEnded_mscoco_train2014_questions.json
v2_OpenEnded_mscoco_val2014_questions.json
v2_OpenEnded_mscoco_test-dev2015_questions.jso
v2_OpenEnded_mscoco_test2015_questions.json
Annotations/
mscoco_train2014_annotations.json
mscoco_val2014_annotations.json
v2_mscoco_train2014_annotations.json
v2_mscoco_val2014_annotations.json
v2_mscoco_train2014_complementary_pairs.json
v2_mscoco_val2014_complementary_pairs.json
./tasks/R2R/snapshots/release/download_speaker_release.sh # Download speaker
./tasks/R2R/snapshots/release/download_follower_release.sh # Download follower
```

2. Download the COCO images from http:https://mscoco.org/, and symbol link it to `exp_vqa/coco_dataset`. After this step, the file structure should look like
The scripts above will save the downloaded models under `./tasks/R2R/snapshots/release/`. To use these downloaded models, set the speaker and follower path prefixes as follows:
```
exp_vqa/coco_dataset/
images/
train2014/
COCO_train2014_000000000009.jpg
...
val2014/
test2015/
...
export SPEAKER_PATH_PREFIX=tasks/R2R/snapshots/release/speaker_final_release
export FOLLOWER_PATH_PREFIX=tasks/R2R/snapshots/release/follower_final_release
```

3. Extract visual features from the images and store them on the disk. In our experiments, we extract visual features using ResNet-152 C5 block. Then, build image collections (imdb) for VQAv1 and VQAv2. These procedures can be down as follows.
### Test

1. Set the path prefixes for the trained speaker and follower model:
```
./exp_vqa/tfmodel/resnet/download_resnet_v1_152.sh # Download ResNet-152
cd ./exp_vqa/data/
python extract_resnet152_c5_7x7.py # feature extraction for all COCO images
python build_vqa_imdb_r152_7x7.py # build image collections for VQAv1
python build_vqa_imdb_r152_7x7_vqa_v2.py # build image collections for VQAv2
cd ../../
# the path prefixes to the trained speaker and follower model
# change these path prefixes if you are using downloaded models.
export SPEAKER_PATH_PREFIX=tasks/R2R/speaker/snapshots/speaker_teacher_imagenet_mean_pooled_train_iter_20000
export FOLLOWER_PATH_PREFIX=tasks/R2R/snapshots/follower_with_pretraining_sample_imagenet_mean_pooled_train_iter_11100
```
(Note that this repository already contains the "expert layout" from parsing results using Stanford Parser. They are the same as in [N2NMN](http:https://ronghanghu.com/n2nmn).)

### Training

0. Add the root of this repository to PYTHONPATH: `export PYTHONPATH=.:$PYTHONPATH`

1. Train on the VQAv1 dataset:
- with ground-truth layout
`python exp_vqa/train_net_vqa.py --cfg exp_vqa/cfgs/vqa_v1_gt_layout.yaml`
- without ground-truth layout
`python exp_vqa/train_net_vqa.py --cfg exp_vqa/cfgs/vqa_v1_scratch.yaml`

2. Train on the VQAv2 dataset:
- with ground-truth layout
`python exp_vqa/train_net_vqa.py --cfg exp_vqa/cfgs/vqa_v2_gt_layout.yaml`
- without ground-truth layout
`python exp_vqa/train_net_vqa.py --cfg exp_vqa/cfgs/vqa_v2_scratch.yaml`
2. Generate top-ranking trajectory predictions with pragmatic inference:
```
# Specify the path prefix to the output evaluation file
export EVAL_FILE_PREFIX=tasks/R2R/eval_outputs/pragmatics
Note:
* By default, the above scripts use GPU 0, and train on the union of *train2014* and *val2014* splits. To run on a different GPU, append `GPU_ID` parameter to the commands above (e.g. appending `GPU_ID 2` to use GPU 2). During training, the script will write TensorBoard events to `exp_vqa/tb/{exp_name}/` and save the snapshots under `exp_vqa/tfmodel/{exp_name}/`.
CUDA_VISIBLE_DEVICES=0 python tasks/R2R/rational_follower.py \
$FOLLOWER_PATH_PREFIX \
$SPEAKER_PATH_PREFIX \
--batch_size 15 --beam_size 40 --state_factored_search \
--use_test_set \
--eval_file $EVAL_FILE_PREFIX
```
This will generate the prediction files in the directory of `EVAL_FILE_PREFIX`, and also print the performance on `val_seen` and `val_unseen` splits.

### Test
The predicted trajectories with the above script contain only the **top-scoring trajectories** among all candidate trajectories, ranked with pragmatic inference.

0. Add the root of this repository to PYTHONPATH: `export PYTHONPATH=.:$PYTHONPATH`
3. For participating in the [Vision-and-Language Navigation Challenge](https://evalai.cloudcv.org/web/challenges/challenge-page/97/overview), add `--physical_traversal` option to generate physically-plausible trajectory predictions with pragmatic inference:
```
# Specify the path prefix to the output evaluation file
export EVAL_FILE_PREFIX=tasks/R2R/eval_outputs/pragmatics_physical
1. Evaluate on the VQAv1 dataset:
`python exp_vqa/test_net_vqa.py --cfg exp_vqa/cfgs/{exp_name}.yaml TEST.ITER 20000`
where `{exp_name}` should be one of `vqa_v1_gt_layout` and `vqa_v1_scratch`. Note:
- By default, the above script evaluates on the *test-dev2015* split of VQAv1. To evaluate on the *test2015* split, append `TEST.SPLIT_VQA test2015` to the command above.
- By default, the above script uses GPU 0. To run on a different GPU, append `GPU_ID` parameter to the commands above (e.g. appending `GPU_ID 2` to use GPU 2).
- The above evaluation script will not print out the accuracy (**the displayed accuracy will be zero**), but will write the prediction outputs to `exp_vqa/eval_outputs/{exp_name}/`, which can be uploaded to the evaluation sever (http:https://www.visualqa.org/roe.html) for evaluation. *Expected accuracy: 66.0% for `vqa_v1_gt_layout`, 65.5% for `vqa_v1_scratch`.*
CUDA_VISIBLE_DEVICES=0 python tasks/R2R/rational_follower.py \
$FOLLOWER_PATH_PREFIX \
$SPEAKER_PATH_PREFIX \
--batch_size 15 --beam_size 40 --state_factored_search \
--use_test_set --physical_traversal \
--eval_file $EVAL_FILE_PREFIX
```
This will generate the prediction files in the directory of `EVAL_FILE_PREFIX`. These prediction files can be submitted to https://evalai.cloudcv.org/web/challenges/challenge-page/97/overview for evaluation.

2. Evaluate on the VQAv2 dataset:
`python exp_vqa/test_net_vqa.py --cfg exp_vqa/cfgs/{exp_name}.yaml TEST.ITER 40000`
where `{exp_name}` should be one of `vqa_v2_gt_layout` and `vqa_v2_scratch`. Note:
- By default, the above script uses GPU 0. To run on a different GPU, append `GPU_ID` parameter to the commands above (e.g. appending `GPU_ID 2` to use GPU 2).
- The above evaluation script will not print out the accuracy (**the displayed accuracy will be zero**), but will write the prediction outputs to `exp_vqa/eval_outputs_vqa_v2/{exp_name}/`, which can be uploaded to the evaluation sever (http:https://www.visualqa.org/roe.html) for evaluation. *Expected accuracy: 64.0% for `vqa_v2_gt_layout`, 64.1% for `vqa_v2_scratch`.*
The major difference with `--physical_traversal` is that now the generated trajectories contain **all states visited by the search algorithm in the order they are traversed**. The agent expands each route one step forward at a time, and then switches to expand the next route. The details are explained in Appendix E in [our paper](https://arxiv.org/pdf/1806.02724.pdf).

## Acknowledgements

Expand Down
Empty file added tasks/R2R/eval_outputs/.keep
Empty file.

0 comments on commit b162873

Please sign in to comment.