Skip to content
This repository has been archived by the owner on Jul 22, 2024. It is now read-only.
/ AITQA Public archive
generated from IBM/repo-template

resources for the IBM Airlines Table-Question-Answering Benchmark

License

Notifications You must be signed in to change notification settings

IBM/AITQA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AIT-QA: Question Answering Dataset over Complex Tables in the Airline Industry

Abstract

Recent advances in transformers have enabled Table Question Answering (Table QA) systems to achieve high accuracy and SOTA results on open domain datasets like WikiTableQuestions and WikiSQL. Such transformers are frequently pre-trained on open-domain content such as Wikipedia, where they effectively encode questions and corresponding tables from Wikipedia as seen in Table QA dataset. However, web tables in Wikipedia are notably flat in their layout, with the first row as the sole column header. The layout lends to a relational view of tables where each row is a tuple. Whereas, tables in domain-specific business or scientific documents often have a much more complex layout, including hierarchical row and column headers, in addition to having specialized vocabulary terms from that domain. To address this problem, we introduce the domain-specific Table QA dataset AITQA (Airline Industry Table QA). The dataset consists of 515 questions authored by human annotators on 116 tables extracted from public U.S. SEC filings (SEC Filings publicly available at: https://www.sec.gov/edgar.shtml) of major airline companies for the fiscal years 2017-2019. We also provide annotations pertaining to the nature of questions, marking those that require hierarchical headers, domain-specific terminology, and paraphrased forms. Our zero-shot baseline evaluation of three transformer-based SOTA Table QA methods - TaPAS (end-to-end), TaBERT (semantic parsing-based), and RCI (row-column encoding-based) - clearly exposes the limitation of these methods in this practical setting, with the best accuracy at just 51.8% (RCI). We also present pragmatic table pre-processing steps used to pivot and project these complex tables into a layout suitable for the SOTA Table QA models.

Dataset

This repository contains the data corresponding to the paper. Please consider refering to the paper draft if you are using them, as below:

@misc{katsis2021aitqa,
      title={AIT-QA: Question Answering Dataset over Complex Tables in the Airline Industry}, 
      author={Yannis Katsis and Saneem Chemmengath and Vishwajeet Kumar and Samarth Bharadwaj and Mustafa Canim and Michael Glass and Alfio Gliozzo and Feifei Pan and Jaydeep Sen and Karthik Sankaranarayanan and Soumen Chakrabarti},
      year={2021},
      eprint={2106.12944},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Detailed information about this dataset and initial experiments can be found here: https://arxiv.org/abs/2106.12944

Content and format

Inside the raw_data folder you will find aitqa_questions.jsonl and aitqa_tables.jsonl. Both files can be read line by line, where each line is a serialized JSON object. The test questions and tables will be released upon acceptance of the dataset paper in submission (details above).

Question, answers, table_id, and other annotations

The instances in aitqa_questions.jsonl looks like the follwing:

{
  "id": "q-0",
  "table_id": "tab-0",
  "question": "How much money did United spend for aircraft fuel in 2016?",
  "answers": ["$5,813"],
  "type": "KPI-driven",
  "row_hierarchy_needed": "No",
  "paraphrase_group": "para-5"
}

The fields represent the follwing:

  • id: The ID of this data instance.
  • table_id: The ID of the table to which questions is addressed.
  • question: The question string.
  • answers: A list containing answer(s) to the question.
  • type: The type of information the question is about. It is either KPI-driven (questions that inquiry about Key Performance Indicators (KPIs) in airline industry) or Table-driven (questions on other concepts).
  • row_hierarchy_needed: If finding the answer relies on the row header hierarchy.
  • paraphrase_group: The ID of paraphrase group. A praphrase group is set of instances with questions which are paraphrases of each other. '' when the question does not have any paraphrase.

Tables

The instances in aitqa_tables.jsonl looks like the follwing:

{  
  "id": "tab-5",
  "column_header": [
    ["At December 31,", "2018"],
    ["At December 31,", "2017 (a)"]
  ],
  "row_header": [
    ["Current assets:", "Cash and cash equivalents"],
    ["Current assets:", "Short-term investments"],
    ["Current assets:", "Receivables, less allowance for doubtful accounts 2018—$8; 2017—$7)"],
    ["Current assets:", "Aircraft fuel, spare parts and supplies, less obsolescence allowance (2018—$412; 2017—$354)"],
    ["Current assets:", "Prepaid expenses and other"],
    ["Current assets:", "Total current assets"],
    ["Owned-", "Operating property and equipment:", "Flight equipment"],
    ["Owned-", "Operating property and equipment:", "Other property and equipment"],
    ["Owned-", "Operating property and equipment:", "Total owned property and equipment"],
    ["Owned-", "Operating property and equipment:", "Less-Accumulated depreciation and amortization"],
    ["Owned-", "Operating property and equipment:", "Total owned property and equipment, net"],
    ["Owned-", "Operating property and equipment:", "Purchase deposits for flight equipment"],
    ["Capital leases-", "Flight equipment"],
    ["Capital leases-", "Other property and equipment"],
    ["Capital leases-", "Total capital leases"],
    ["Capital leases-", "Less-Accumulated amortization"],
    ["Capital leases-", "Total capital leases, net"],
    ["Capital leases-", "Total operating property and equipment, net"],
    ["Other assets:", "Goodwill"],
    ["Other assets:", "Intangibles, less accumulated amortization (2018-$1,380; 2017-$1,313)"],
    ["Other assets:", "Restricted cash"],
    ["Other assets:", "Notes receivable, net"],
    ["Other assets:", "Investments in affiliates and other, net"],
    ["Other assets:", "Total other assets"],
    ["Total assets"]
  ],
  "data": [
    ["$1,694", "$1,482"],
    ["2,256", "2,316"],
    ["1,346", "1,340"],
    ["985", "924"],
    ["913", "1,071"],
    ["7,194", "7,133"],
    ["31,607", "28,692"],
    ["7,919", "6,946"],
    ["39,526", "35,638"],
    ["(12,760)", "(11,159)"],
    ["26,766", "24,479"],
    ["1,177", "1,344"],
    ["1,029", "1,151"],
    ["11", "11"],
    ["1,040", "1,162"],
    ["(654)", "(777)"],
    ["386", "385"],
    ["28,329", "26,208"],
    ["4,523", "4,523"],
    ["3,159", "3,539"],
    ["105", "91"],
    ["516", "46"],
    ["966", "806"],
    ["9,269", "9,005"],
    ["$44,792", "$42,346"]
  ]
}

The fields represent the following:

  • id: The Table ID.
  • column_header: A list of column names in the table. Column names can be hierarchical and sublist captures the order of hierarchy.
  • row_header: A list of row headers in the tbale. Row headers can be hierarchical and sublist captures the order of hierarchy.
  • data: A list of rows. Each row is a list of row entries.

Notes

Any changes or improvement to the dataset is logged in CHANGELOG.md

If you have any questions please yse tge discussion feature and not issues. Create a new [issue here][issues] to flag dataset inconsistancies or improvements generally after a discussion.

We will allow Contributions to the repo as per CONTRIBUTING.md