This is the code for the IEP-Ref model, a module network approach for referring expression problems. See our paper:
@article{liu2019clevr,
author = {Runtao Liu and
Chenxi Liu and
Yutong Bai and
Alan Yuille},
title = {CLEVR-Ref+: Diagnosing Visual Reasoning with Referring Expressions},
journal = {arXiv preprint arXiv:1901.00850},
year = {2019}
}
All code was developed and tested on Ubuntu 16.04 with Python 3.5.
You can set up a virtual environment to run the code like this:
virtualenv -p python3 .env # Create virtual environment
source .env/bin/activate # Activate virtual environment
pip install -r requirements.txt # Install dependencies
echo $PWD > .env/lib/python3.5/site-packages/iep.pth # Add this package to virtual environment
# Work for a while ...
deactivate # Exit virtual environment
Before you can train any models, you need to download the CLEVR-Ref+ dataset, extract features for the images, and preprocess the referring expressions and programs.
First you need to download and unpack the CLEVR-Ref+ dataset.
For the purpose of this tutorial we assume that all data will be stored in a new directory called data/
:
Extract ResNet-101 features for the CLEVR-Ref+ train and val images with the following commands:
python scripts/extract_features.py \
--input_image_dir data/clevr_ref+_1.0/images/train/ \
--output_h5_file data/train_features.h5 \
--image_height 320 \
--image_width 320 \
--batch_size 32
python scripts/extract_features.py \
--input_image_dir data/clevr_ref+_1.0/images/val/ \
--output_h5_file data/val_features.h5 \
--image_height 320 \
--image_width 320 \
--batch_size 32
Preprocess the referring expressions and programs for the CLEVR-Ref+ train and val sets with the following commands:
python scripts/preprocess_refexps.py \
--input_refexps_json data/clevr_ref+_1.0/refexps/clevr_ref+_train_refexps.json \
--input_scenes_json data/clevr_ref+_1.0/scenes/clevr_ref+_train_scenes.json \
--num_examples -1 \
--output_h5_file data/train_refexps.h5 \
--height 320 \
--width 320 \
--output_vocab_json data/vocab.json
python scripts/preprocess_refexps.py \
--input_refexps_json data/clevr_ref+_1.0/refexps/clevr_ref+_val_refexps.json \
--input_scenes_json data/clevr_ref+_1.0/scenes/clevr_ref+_val_scenes.json \
--num_examples -1 \
--output_h5_file data/val_refexps.h5 \
--height 320 \
--width 320 \
--input_vocab_json data/vocab.json
When preprocessing referring expressions, we create a file vocab.json
which stores the mapping between
tokens and indices for referring expressions and programs. We create this vocabulary when preprocessing
the training referring expressions, then reuse the same vocabulary file for the val referring expressions.
Models are trained through a three-step procedure:
- Train the program generator using a small number of ground-truth programs
- Train the execution engine using predicted outputs from the trained program generator
- Jointly fine-tune both the program generator and the execution engine without any ground-truth programs
All training code runs on GPU, and assumes that CUDA and cuDNN already been installed.
In this step we use a small number of ground-truth programs to train the program generator:
mkdir data/run_PG_ref_18k
python scripts/train_model.py \
--model_type PG \
--num_iterations 32000 \
--num_train_samples 18000 \
--checkpoint_every 1000 \
--checkpoint_path data/run_PG_ref_18k/program_generator.pt \
--batch_size 64 \
--train_refexp_h5 data/train_refexps.h5 \
--train_features_h5 data/train_features.h5 \
--val_refexp_h5 data/val_refexps.h5 \
--val_features_h5 data/val_features.h5 \
--vocab_json data/vocab.json \
In this step we train the execution engine, based on the trained program generator:
mkdir data/run_fixedPG+EE_ref
python scripts/train_model.py \
--program_generator_start_from ./data/run_PG_ref_18k/program_generator.pt_32000 \
--train_execution_engine 1 \
--train_program_generator 0 \
--model_type PG+EE \
--num_iterations 450000 \
--learning_rate 1e-4 \
--checkpoint_path data/run_fixedPG+EE_ref/execution_engine.pt \
--checkpoint_every 5000 \
--train_refexp_h5 data/train_refexps.h5 \
--train_features_h5 data/train_features.h5 \
--val_refexp_h5 data/val_refexps.h5 \
--val_features_h5 data/val_features.h5 \
--vocab_json data/vocab.json \
--batch_size 48 \
--feature_dim 1024,20,20
Another option is that you can also train the execution engine using the ground-truth programs.
mkdir run_gt_EE
python scripts/train_model.py \
--model_type EE \
--num_iterations 450000 \
--learning_rate 1e-4 \
--checkpoint_path data/run_gt_EE/execution_engine.pt \
--checkpoint_every 5000 \
--train_refexp_h5 data/train_refexps.h5 \
--train_features_h5 data/train_features.h5 \
--val_refexp_h5 data/val_refexps.h5 \
--val_features_h5 data/val_features.h5 \
--vocab_json data/vocab.json \
--batch_size 48 \
--feature_dim 1024,20,20
In this step we jointly train the program generator and execution engine using REINFORCE:
mkdir data/run_jointPG+EE_ref
python scripts/train_model.py \
--program_generator_start_from data/run_PG_ref_18k/model/program_generator.pt_32000 \
--execution_engine_start_from data/run_fixedPG+EE_ref/model/execution_engine.pt_450000 \
--train_execution_engine 1 \
--train_program_generator 1 \
--model_type PG+EE \
--num_iterations 200000 \
--learning_rate 5e-5 \
--checkpoint_path data/run_jointPG+EE_ref/joint_pg_ee.pt \
--checkpoint_every 5000 \
--train_refexp_h5 data/train_refexps.h5 \
--train_features_h5 data/train_features.h5 \
--val_refexp_h5 data/val_refexps.h5 \
--val_features_h5 data/val_features.h5 \
--vocab_json data/vocab.json \
--batch_size 48 \
--feature_dim 1024,20,20
You can use the run_model.py
script to test your model on the entire validation
sets.
To test the model based on the programs predicted from program generator:
python scripts/run_model.py \
--program_generator /path/to/pg.pt \
--execution_engine /path/to/ee.pt \
--input_refexp_h5 data/val_refexps.h5 \
--input_features_h5 data/val_features.h5 \
--batch_size 24 \
--result_output_path ./data/result.json
To test the model using the ground-truth programs:
python scripts/run_model.py \
--program_generator /path/to/pg.pt \
--execution_engine /path/to/ee.pt \
--input_refexp_h5 data/val_refexps.h5 \
--input_features_h5 data/val_features.h5 \
--batch_size 24 \
--result_output_path ./data/result.json \
--use_gt_programs 1