AmbigQA: Answering Ambiguous Open-domain Questions

AmbigQA Baseline Models (Reproduction)

This repo contains multiple models for open-domain question answering. This code is based on the original implementation and uses PyTorch and HuggingFace Transformers.

This repository builds off of the original implementation of "Sewon Min, Julian Michael, Hannaneh Hajishirzi, Luke Zettlemoyer. AmbigQA: Answering Ambiguous Open-domain Questions. 2020". Please reference their repository and website for more information on the AmbigQA task and AmbigNQ dataset, and make sure to cite their paper if you find them useful.

  1. Installation
  2. Download data
  3. Instructions for Training & Testing
  4. Results
  5. Pretrained model checkpoint
  6. Usage examples


Tested with python 3.6.12 and let $ indicate bash commands.

$ pip install torch==1.1.0
$ pip install git+
$ pip install wget

Download data

Let {dpr_data_dir} be a directory to save data (can be replaced with a directory of your choosing) and let $ indicate bash commands.

$ mkdir {dpr_data_dir}
$ python3 --resource data.wikipedia_split.psgs_w100 --output_dir {dpr_data_dir}
$ python3 --resource data.wikipedia_split.psgs_w100_20200201 --output_dir {dpr_data_dir}
$ python3 --resource data.nqopen --output_dir data
# python3 --resource data.gold_passages_info.nq_train --output_dir data
# python3 --resource data.ambigqa --output_dir data

DPR Retrieval

For training DPR retrieval, please refer to the original implementation. This code is for taking checkpoint from the original implementation, and running inference.

Step 1: Download DPR retrieval checkpoint provided by DPR original implementation.

$ python3 --resource checkpoint.retriever.multiset.bert-base-encoder --output_dir {dpr_data_dir}

Step 2: Run inference to obtain passage vectors. Note: if you are using a checkpoint, there is no need to run this section, and you may skip to DPR Reader (Span Selection Model).

$ for i in 0 1 2 3 4 5 6 7 8 9 ; do \ # for parallelization
    python3 --do_predict --bert_name bert-base-uncased --output_dir out/dpr --dpr_data_dir dpr_data_dir --do_predict --task dpr --predict_batch_size 3200 --db_index $i ; \
  • --predict_batch_size of 3200 is good for one 32gb GPU.
  • --verbose to print a progress bar
  • This script will tokenize passages in Wikipedia which will takes time. If you want to pre-tokenize first and then launch the job on gpus afterward, please do the following: (1) run the above command with --do_prepro_only, and (2) re-run the above command without --do_prepro_only.

Each run will take around 1.5 hours with one 32 gpu.

Step 3: Run inference to obtain question vectors and save the retrieval predictions.

python3 --bert_name ber-base-uncased --output_dir out/dpr --dpr_data_dir data --do_predict --task dpr --predict_batch_size 3200 --predict_file data/nqopen/{train|dev|test}.json

This script will print out recall rate and save the retrieval results as out/dpr/{train|dev|test}_predictions.json.

Tip1: Running this for the first time regardless of the data split will create DPR index and save it, so that the next runs can reuse them. If you do not want to create DPR index multiple times, you can run on one data split first, and run the others afterward. If you have resource to run them in parallel, it may save time to just run all of them in parallel.

Tip2: If you are fine with not printing the recall rate, you can specify --skip_db_load to save time. It will then print the recall to be 0, but the prediction file will be saved with no problem.

DPR Reader (Span Selection Model)

Note: if you are using a checkpoint, there is no need to run this section, and you may skip to SpanSeqGen (BART Reader).

For training on NQ-open, run

$ python3 --do_train --task qa --output_dir out/nq-span-selection \
    --dpr_data_dir {dpr_data_dir} \
    --train_file data/nqopen/train.json \
    --predict_file data/nqopen/dev.json \
    --bert_name {bert-base-uncased|bert-large-uncased} \
    --train_batch_size 32 --train_M 32 --predict_batch_size 128 \
    --eval_period 2000 --wait_step 10
  • This script will save preprocessed input data so that it can re-load them once it is created. You might want to preprocess data before launching a job on GPUs.
  • train_batch_size is # of questions / batch, and train_M is # of passages / question. Thus, # of (question, passage) / batch is train_batch_size * train_M, which matters for GPU usage. With one 32gb GPU and bert-base-uncased, you can use train_batch_size * train_M of 128, as hyperparamters specified in the command above.
  • eval_period is an interval to test on the dev data. The script will only save the best checkpoint based on the dev data. If you prefer, you can specify skip_inference to skip inference on the dev data and save all checkpoints. You can then run the inference script (described next) on the dev data using every checkpoint, and choose the best checkpoint.
  • wait_step is the number of steps to wait since the best checkpoint, until the training is finished.

When training is done, run the following command for prediction.

$ python3 --do_predict --task qa --output_dir out/nq-span-selection \
    --dpr_data_dir {dpr_data_dir} \
    --predict_file data/nqopen/{dev|test}.json \
    --bert_name {bert-base-uncased|bert-large-uncased} \
    --predict_batch_size 32

This command runs predictions using out/nq-span-selection/ by default. If you want to run predictions using another checkpoint, please specify its path by --checkpoint.

SpanSeqGen (BART Reader)

You may train the SpanSeqGen model on NQ-open (as done in the original paper) or on SQuAD (new in our reproduction). Note that this model is different from BART closed-book QA model (implemented here), because this model reads DPR retrieved passages as input.

Train on NQ-open

Note: if you are using a checkpoint, there is no need to run the first two code segments since the passages have already been selected. You may simply run the third code segment (though please make sure that the checkpoint is located in out/nq-span-selection).

First, tokenize passage vectors.

$ for i in 0 1 2 3 4 5 6 7 8 9 ; do \ # for parallelization
    python3 --bert_name bart-large --output_dir out/dpr --dpr_data_dir {dpr_data_dir} --do_predict --do_prepro_only --task dpr --predict_batch_size 3200 --db_index $i \

Then, save passage selection from the trained DPR reader:

$ python3 --do_predict --task qa --output_dir out/nq-span-selection \
    --dpr_data_dir {dpr_data_dir} \
    --predict_file data/nqopen/{train|dev|test}.json \
    --bert_name {bert-base-uncased|bert-large-uncased} \
    --predict_batch_size 32 --save_psg_sel_only

Now, train a model on NQ-open by:

$ python3 --do_train --task qa --output_dir out/nq-span-seq-gen \
    --dpr_data_dir {dpr_data_dir} \
    --train_file data/nqopen/train.json \
    --predict_file data/nqopen/dev.json \  
    --psg_sel_dir out/nq-span-selection \   
    --bert_name bart-large \
    --discard_not_found_answers \
    --train_batch_size 2 --predict_batch_size 2 \
    --eval_period 2000 --wait_step 10 --max_input_length 700
  • --max_input_length is the maximum length of the input. Any input longer than this number will be truncated. The original authors used 1024 for this value but we suggest using 700 if you are using a smaller GPU (e.g. 12 GB).

  • --do_train specifies that we are training the model. If you would like to evaluate your model on NQ-open, you may do so by replacing this command line argument with --do_predict as shown in the box below

$ python3 --do_predict --task qa --output_dir out/nq-span-seq-gen \
    --dpr_data_dir {dpr_data_dir} \
    --train_file data/nqopen/train.json \
    --predict_file data/nqopen/dev.json \
    --psg_sel_dir out/nq-span-selection \
    --bert_name bart-large \
    --discard_not_found_answers \
    --train_batch_size 2 --predict_batch_size 2 \
    --eval_period 2000 --wait_step 10 --max_input_length 700

Train on SQuAD

First, we must preprocess the dataset to be the correct format

$ python3 --output_dir data/ \
    --dpr_data_dir {dpr_data_dir} \
    --dpr_dir out/dpr/
    -- resouce squad

Now you may train a model on SQuAD:

$ python3 --do_train --task qa --output_dir out/squad-span-seq-gen     
    --dpr_data_dir {dpr_data_dir} \
    --train_file data/squad/train.json \
    --predict_file data/squad/dev.json \
    --psg_sel_dir out/nq-span-selection \
    --bert_name bart-large \
    --discard_not_found_answers \
    --train_batch_size 2   
    --predict_batch_size 2 \
    --eval_period 2000 --wait_step 10 --train_on_squad --max_input_length 150

To evaluate your model on SQuAD, replace the --do_train command line argument with --do_predict as shown in the box below

$ python3 --do_predict --task qa --output_dir out/squad-span-seq-gen \
    --dpr_data_dir {dpr_data_dir} \
    --train_file data/squad/train.json \
    --predict_file data/squad/dev.json \
    --psg_sel_dir out/nq-span-selection \
    --bert_name bart-large \
    --discard_not_found_answers \
    --train_batch_size 2 --predict_batch_size 2 \
    --eval_period 2000 --wait_step 10 --train_on_squad --max_input_length 150

Finetuning on AmbigQA

In order to experiment on AmbigQA, you can simply repeat the process with NQ-open, with only two differences - (i) specifying --ambigqa and --wiki_2020 at several places and (ii) initialize weights from models trained on NQ-open. Step-by-step instructions are as follows.

First, make DPR retrieval predictions using Wikipedia 2020. You can do so by simply repeating Step 2 and Step 3 of DPR Retrieval with --wiki_2020 specified.

Note: if you are using a checkpoint, there is no need to run the next three code segments. You may skip directly to the fourth code segment to fine tune on AmbigNQ.

$ for i in 0 1 2 3 4 5 6 7 8 9 ; do \ # for parallelization
    python3 --do_predict --bert_name bert-base-uncased --output_dir out/dpr --dpr_data_dir {dpr_data_dir} --do_predict --task dpr --predict_batch_size 3200 --db_index $i --wiki_2020 \
$ python3 --do_predict --task dpr --output_dir out/dpr \
    --dpr_data_dir {dpr_data_dir} \
    --predict_file data/nqopen/{train|dev|test}.json \
    --bert_name ber-base-uncased \
    --predict_batch_size 3200  --wiki_2020

In order to fine-tune DPR span selection model on AmbigQA, run the training command similar to NQ training command, but with --ambigqa and --wiki2020 specified. We also used smaller eval_period as the dataset size is smaller.

$ python3 --do_train --task qa --output_dir out/ambignq-span-selection \
    --dpr_data_dir {dpr_data_dir} \
    --train_file data/ambigqa/train_light.json \
    --predict_file data/ambigqa/dev_light.json \
    --bert_name {bert-base-uncased|bert-large-uncased} \
    --train_batch_size 32 --train_M 32 --predict_batch_size 32 \
    --eval_period 500 --wait_step 10 --topk_answer 3 --ambigqa --wiki_2020

In order to fine-tune SpanSeqGen on AmbigQA, first run the inference script over DPR to get highly ranked passages, just like we did on NQ.

$ python3 --do_predict --task qa --output_dir out/nq-span-selection \
    --dpr_data_dir {dpr_data_dir} \
    --predict_file data/nqopen/{train|dev|test}.json \
    --bert_name {bert-base-uncased|bert-large-uncased} \
    --predict_batch_size 32 --save_psg_sel_only --wiki_2020

Next, train SpanSeqGen on AmbigNQ via the following command, which specifies --ambigqa, --wiki_2020 and --max_answer_length 25.

$ python3 --do_train --task qa --output_dir out/ambignq-span-seq-gen \
    --dpr_data_dir {dpr_data_dir} \
    --train_file data/ambigqa/train_light.json \
    --predict_file data/ambigqa/dev_light.json \
    --psg_sel_dir out/nq-span-selection \
    --bert_name bart-large \
    --discard_not_found_answers \
    --train_batch_size 2 --predict_batch_size 2 \
    --eval_period 500 --wait_step 10 --ambigqa --wiki_2020 \
    --max_answer_length 25 --max_input_length 700

To evaluate your model on AmbigNQ, simply replace --do_train in the previous command with --do_predict as shown below:

$ python3 --do_predict --task qa --output_dir out/ambignq-span-seq-gen \
    --dpr_data_dir {dpr_data_dir} \
    --train_file data/ambigqa/train_light.json \
    --predict_file data/ambigqa/dev_light.json \
    --psg_sel_dir out/nq-span-selection \
    --bert_name bart-large \
    --discard_not_found_answers \
    --train_batch_size 2 --predict_batch_size 2 \
    --eval_period 500 --wait_step 10 --ambigqa --wiki_2020 \
    --max_answer_length 25 --max_input_length 700

Hyperparameter details / tuning

On NQ-open: For BERT-base, we use train_batch_size=32, train_M=32 (w/ eight 32GB gpus). For BERT-large, we use train_batch_size=8, train_M=16 (w/ four 32GB gpus). For BART, we use train_batch_size=24 (w/ four 32GB gpus). For others, we use default hyperparameters.

On AmbigQA: We use train_batch_size=8 for BERT-base and train_batch_size=24 for BART. We use learning_rate=5e-6 for both.

To do the exploration of hyperparameter impact on beam size or the inference time, simply run the included bash script as shown below. This will try several values for beam size (1, 2, 6, 10, and 12), length penalty (1, 3, 5, and 10), and no repeat ngram (0, 1, 2, 3). To try different values, please edit the beams, penaltys, and ngrams variables in

$ ./


NQ-open (dev) NQ-open (test) AmbigQA zero-shot (dev) AmbigQA zero-shot (test) AmbigQA (dev) AmbigQA (test)
DPR (original implementation) 39.8 41.5 35.2/26.5 30.1/23.2 37.1/28.4 32.3/24.8
DPR (this code) 40.6 41.6 35.2/23.9 29.9/21.4 36.8/25.8 33.3/23.4
DPR (this code) w/ BERT-large 43.2 44.3 - - - -
SpanSeqGen (reported) 42.0 42.2 36.4/24.8 30.8/20.7 39.7/29.3 33.5/24.5
SpanSeqGen (this code) 43.1 45.0 37.4/26.1 33.2/22.6 40.3/29.2 35.5/25.8

Two numbers on AmbigQA indicate F1 score on all questions and F1 score on questions with multiple QA pairs only.

By default, the models are based on BERT-base and BART-large.

Note (as of 07/2020): Note that numbers are slightly different from those reported in the paper, because numbers in the paper are based on experiments with fairseq. We re-implemented the models with Huggingface Transformers, and were able to obtain similar/better numbers. We will update numbers in the paper of the next version.

Note: There happen to be two versions of NQ answers which marginally differ in tokenization methods (e.g. July 15 , 2020 vs. July 15, 2020 or 2019 - 2020 vs. 2019--2020). Research papers outside Google (#1, #2, #3, #4, #5, #6, #7, #8) have been using this version, and in June 2020 the original NQ/NQ-open authors release the original version that have been used in research papers from Google (#1, #2, #3). We verified that the performance differences are marginal when applying simple postprocessing (e.g. text.replace(" - ", "-").replace(" : ", ":")). The numbers reported here as well as codes follow Google's original version. Compared to the previous version, performance difference is 40.6 (original) vs. 40.3 (previous) vs. 40.7 (union of two) on the dev set and 41.6 (original) vs. 41.7 (previous) vs. 41.8 (union of two) on the test set. Nonetheless, we advice to use the original version provided by Google in the future.

Results with less resources

The readers are not very sensitive to hyperparamters (train_batch_size and train_M). In case you want to experiment with less resources and want to check the reproducibility, here are our results depending on the number of 32gb GPUs.

DPR with BERT-base:

Num. of 32gb GPU(s) (train_batch_size, train_M) NQ-open (dev) NQ-open (test)
1 (8, 16) 40.5 41.4
2 (16, 16) 40.9 41.1
4 (16, 32) 41.2 41.1
8 (32, 32) 40.6 41.6

DPR with BERT-large:

Num. of 32gb GPU(s) (train_batch_size, train_M) NQ-open (dev) NQ-open (test)
2 (8, 8) 42.0 43.4
4 (8, 16) 43.2 44.3
8 (16, 16) 42.2 43.2

SpanSeqGen with BART-large:

Num. of 12GB GPU(s) (train_batch_size, max_input_len) NQ-open EM (dev) AmbigNQ F1 (dev)
1 (2, 700) 37.81 39.38

Need preprocessed data / pretrained models / predictions?


Question Answering Click in order to download checkpoints:

Passage Reranking from DPR Reader

For a sanity check, the recall accuracy should be as follows. (For AmbigQA, macro-average of recall.)

k NQ train NQ dev NQ test AmbigQA train AmbigQA dev
1 80.4 59.8 59.4 58.3 51.8
5 86.8 75.9 76.3 72.7 70.0
10 87.8 79.9 80.8 76.2 74.8
100 89.2 86.2 87.4 81.2 83.1

Question Disambiguation Coming soon!

Usage examples

Here are some examples for running these models:

Train only SpanSeqGen

In the below run we use checkpoints for DPR Retrieval and DPR Reader to train SpanSeqGen on NQ-open, then fine tune the trained model on AmbigQA. This run has been successfully tested by running on a single Azure NC12 machine (24 GiB).

$ conda create --name ambigqa python=3.6.12
$ conda activate ambigqa

# Import libraries
$ pip install torch==1.1.0
$ pip install git+
$ pip install wget

# Clone git repository
$ git clone
$ cd AmbigQA/codes

# Download data
$ mkdir dpr_data_dir
$ python3 --resource data.wikipedia_split.psgs_w100 --output_dir ./dpr_data_dir
$ python3 --resource data.wikipedia_split.psgs_w100_20200201 --output_dir ./dpr_data_dir
$ python3 --resource checkpoint.retriever.multiset.bert-base-encoder --output_dir ./dpr_data_dir
$ python3 --resource data.nqopen --output_dir ./data
$ python3 --resource data.gold_passages_info.nq_train --output_dir ./data
$ python3 --resource data.ambigqa --output_dir ./data

# Download checkpoint for DPR predictions on NQ
$ mkdir out
$ mkdir out/dpr
$ wget
$ unzip
$ mv nq-dpr/* out/dpr/
$ rm -r nq-dpr

# Download Reranking result (37M)
$ mkdir out/nq-span-selection
$ wget
$ unzip
$ mv reranking_results/nq_dev.json out/nq-span-selection/dev_psg_sel.json
$ mv reranking_results/nq_train.json out/nq-span-selection/train_for_inference_psg_sel.json
$ mv reranking_results/nq_test.json out/nq-span-selection/test_psg_sel.json

# Download DPR Reader trained on NQ (387M)
$ wget
$ unzip
$ mv nq-bert-base-uncased-32-32-0/ out/nq-span-selection/
$ rm -r nq-bert-base-uncased-32-32-0

$ rm *.zip

# Train SpanSeqGen
$ conda activate ambigqa
$ python3 --do_train --task qa --output_dir out/nq-span-seq-gen \
    --dpr_data_dir ./dpr_data_dir \
    --train_file ./data/nqopen/train.json \
    --predict_file ./data/nqopen/dev.json \
    --psg_sel_dir ./out/nq-span-selection \
    --bert_name bart-large \
    --discard_not_found_answers \
    --train_batch_size 2 --predict_batch_size 2 \
    --eval_period 2000 --wait_step 10 --max_input_length 700

# Fine tune on AmbigQA
$ mv reranking_results/ambigqa_dev_2020.json out/nq-span-selection/dev_20200201_psg_sel.json
$ mv reranking_results/ambigqa_train_2020.json out/nq-span-selection/train_for_inference_20200201_psg_sel.json
$ python3 --do_train --task qa --output_dir out/ambignq-span-seq-gen \
    --dpr_data_dir dpr_data_dir \
    --train_file data/ambigqa/train_light.json \
    --predict_file data/ambigqa/dev_light.json \
    --psg_sel_dir out/nq-span-selection \
    --bert_name bart-large \
    --discard_not_found_answers \
    --train_batch_size 2 --predict_batch_size 2 \
    --eval_period 500 --wait_step 10 --ambigqa --wiki_2020 --max_answer_length 25

# Do hyperparameter impact on inference time
$ ./


