Skip to content

Latest commit



272 lines (224 loc) · 12 KB

File metadata and controls

272 lines (224 loc) · 12 KB

Open Domain Question Answering over Tables via Dense Retrieval

This document contains models and steps to reproduce the results of Open Domain Question Answering over Tables via Dense Retrieval published at NAACL2021.

Retrieval Models

Size Type Hard Negatives Down Project Recall@1 Recall@10 Recall@50 Link
LARGE Pretrained No No
LARGE Pretrained No 256
MEDIUM Pretrained No 256
SMALL Pretrained No 256
TINY Pretrained No 256
LARGE Finetuned on NQ No 256 35.9 75.9 91.4
LARGE Finetuned on NQ Yes 256 44.2 81.8 92.3
MEDIUM Finetuned on NQ No 256 37.1 74.5 88.0
MEDIUM Finetuned on NQ Yes 256 44.9 79.8 91.1
SMALL Finetuned on NQ No 256 37.6 72.8 87.4
SMALL Finetuned on NQ Yes 256 41.8 77.1 89.9
TINY Finetuned on NQ No 256 17.3 54.1 76.3
TINY Finetuned on NQ Yes 256 22.2 61.3 78.9

Reader models

Size Hard Negatives Link

Load directly the released data

mkdir -p "${nq_data_dir}"
gsutil -m cp -R gs:https://tapas_models/2021_07_22/nq_tables/* "${nq_data_dir}"

Or generate the data

The following pipeline will generate the subset of Natural Questions where the answers are part of tables.


# Set GCP_PROJECT and GCP_BUCKET variables
gcloud config set project "${GCP_PROJECT}"
gcloud auth application-default login
python3 sdist
python3 tapas/scripts/ \
  --input_path="gs:https://natural_questions/v1.0" \
  --output_path="gs:https://${GCP_BUCKET}/nq_tables" \
  --runner_type="DATAFLOW" \
  --save_main_session \
  --gc_project="${GCP_PROJECT}" \
  --gc_region="us-west1" \
  --gc_job_name="create-intermediate" \
  --gc_staging_location="gs:https://${GCP_BUCKET}/staging" \
  --gc_temp_location="gs:https://${GCP_BUCKET}/tmp" \
mkdir -p "${nq_data_dir}"
gsutil -m cp -R "gs:https://${GCP_BUCKET}/nq_tables/*" "${nq_data_dir}"

Or you can also run the pipeline locally but that will take a long time and memory:

mkdir -p "${nq_data_dir}/raw"
gsutil -m cp -R gs:https://natural_questions/v1.0/* "${nq_data_dir}/raw"
python3 tapas/scripts/ \
  --input_path="gs:https://natural_questions/v1.0" \
  --output_path="${nq_data_dir}" \

Retrieval Flow

The full-fledged retrieval process is composed of the following steps. Each step is described in details below.

  1. Pre-train the model.
  2. Fine-tune the model.
  3. Select the best checkpoint w.r.t to some retrieval metric (e.g., eval_precision_at_1) in the local setting (which considers all tables that appear in the dev set as the corpus). These metrics are printed to XM.
  4. Produce global predictions for the selected best checkpoint - these consist of representations for all tables in the corpus.
  5. Generate retrieval metrics w.r.t to the global setting, and write KNN tables ids and scores for each query to a JSON file (to be used for negatives mining or E2E QA).

Fine-Tuning a retrieval model

Download a pretrained checkpoint:

gsutil cp "gs:https://tapas_models/2021_04_27/${retrieval_model_name}.zip" . && unzip "${retrieval_model_name}.zip"

Then we can create the data for the retrieval model

python3 tapas/retrieval/ \
  --input_interactions_dir="${nq_data_dir}/interactions" \
  --input_tables_dir=${nq_data_dir}/tables \
  --output_dir="${nq_data_dir}/tf_examples" \
  --vocab_file="${retrieval_model_name}/vocab.txt" \
  --max_seq_length="${max_seq_length}" \
  --max_column_id="${max_seq_length}" \
  --max_row_id="${max_seq_length}" \

and train a dual encoder model

python3 tapas/experiments/ \
   --do_train \
   --use_tpu \
   --keep_checkpoint_max=40 \
   --model_dir="${model_dir}" \
   --input_file_train="${nq_data_dir}/tf_examples/train.tfrecord" \
   --bert_config_file="${retrieval_model_name}/bert_config.json" \
   --init_checkpoint="${retrieval_model_name}/model.ckpt" \
   --init_from_single_encoder=false \
   --down_projection_dim=256 \
   --num_train_examples=5120000 \
   --learning_rate=1.25e-5 \
   --train_batch_size=256 \
   --warmup_ratio=0.01 \

It's recommended to start a separate eval job to continuously produce predictions for the checkpoints created by the training job. This will also create json files with compputed metrics that will allow doing early stopping.

python3 tapas/experiments/ \
   --do_predict \
   --model_dir="${model_dir}" \
   --input_file_eval="${nq_data_dir}/tf_examples/dev.tfrecord" \
   --bert_config_file="${retrieval_model_name}/bert_config.json" \
   --init_from_single_encoder=false \
   --down_projection_dim=256 \
   --eval_batch_size=32 \
   --num_train_examples=5120000 \

Predict using the best dev checkpoint

Once training is done, we use the best checkpoint to gerenerate embeddings for all the tables and all of the training data queries. This will be necessary to train the reader model as well as realistic evaluation using all table candidates.

for mode in train tables test
  python3 tapas/experiments/ \
     --do_predict \
     --model_dir="${model_dir}" \
     --prediction_output_dir="${model_dir}/${mode}" \
     --evaluated_checkpoint_metric=precision_at_1 \  # This actually represents recall@1
     --input_file_predict="${nq_data_dir}/tf_examples/${mode}.tfrecord" \
     --bert_config_file="${retrieval_model_name}/bert_config.json" \
     --init_from_single_encoder=false \
     --down_projection_dim=256 \
     --eval_batch_size=32 \

Generate Retrieval Results

Run evaluation to print recall@k scores in the global setting given the best model (e.g., 5K checkpoint in this case). Also, generate all KNN most similar tables per query and their similarity scores to a jsonl file.

  • Set prediction_files_local to the best model output. This file holds the query ids, their representations, and the ids for the gold table.
  • Set prediction_files_global to the output path of the last step.
step=<SET_STEPS>  # Set this value according to the best dev results. The train and tables predictions generated in the previous step will only exist for this step.

# Computes train results
python tapas/scripts/ \
 --prediction_files_local=${model_dir}/train/predict_results_${step}.tsv \
 --prediction_files_global=${model_dir}/tables/predict_results_${step}.tsv \

# Computes test results
python tapas/scripts/ \
 --prediction_files_local=${model_dir}/test/predict_results_${step}.tsv \
 --prediction_files_global=${model_dir}/tables/predict_results_${steps}.tsv \

# Computes dev results
python tapas/scripts/ \
 --prediction_files_local=${model_dir}/eval_results_${step}.tsv \
 --prediction_files_global=${model_dir}/tables/predict_results_${steps}.tsv \

Create training data for reader model

First we create the training data

python3 tapas/retrieval/ \
  --input_dir="${nq_data_dir}/interactions" \
  --table_file="${nq_data_dir}/tables/tables.tfrecord" \
  --index_files_pattern="${FLAGS_model_dir}/*_knn.jsonl" \

gsutil cp "gs:https://tapas_models/2020_08_05/${reader_model_name}.zip" . && unzip "${reader_model_name}.zip"
python3 tapas/ \
  --task="NQ_RETRIEVAL" \
  --verbosity=-1 \
  --input_dir="${nq_data_dir}/e2e" \
  --output_dir="${nq_data_dir}/e2e" \
  --bert_vocab_file="${reader_model_name}/vocab.txt" \
  --mode="create_data" \
  --use_document_title \
  --update_answer_coordinates \

Fine-tune reader model

python3 tapas/ \
  --task="NQ_RETRIEVAL" \
  --output_dir="${nq_data_dir}/e2e" \
  --model_dir="${model_dir}" \
  --init_checkpoint="${reader_model_name}/model.ckpt" \
  --bert_config_file="${reader_model_name}/bert_config.json" \
  --mode="train" \

This will use the preset hyper-parameters set in

It's recommended to start a separate eval job to continuously produce predictions for the checkpoints created by the training job. Alternatively, you can run the eval job after training to only get the final results.

python3 tapas/ \
  --task="NQ_RETRIEVAL" \
  --output_dir="${nq_data_dir}/e2e" \
  --model_dir="${model_dir}" \
  --init_checkpoint="${reader_model_name}/model.ckpt" \
  --bert_config_file="${reader_model_name}/bert_config.json" \
  --bert_vocab_file="${reader_model_name}/vocab.txt" \


This code and data derived from Natural Questions are licensed under the Apache License, Version 2.0. The pretraining data is licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License.
See also the Wikipedia Copyrights page.

How to cite this data and code?

You can cite the paper and the released data to appear in NAACL 2021.