Intermediate Pre-Training Data

In our latest models, two training objectives are used after the standard mask language modeling and before fine-tuning, hence the name intermediate pre-training. Both objectives are binary classification tasks on a sentence table pair, where the tables are real Wikipedia tables based on the ones released here. To explain each of the two tasks let us use the following example table:

Rank	Player	Country	Earnings	Events	Wins
1	Greg Norman	Australia	1,654,959	16	3
2	Billy Mayfair	United States	1,543,192	28	2
3	Lee Janzen	United States	1,378,966	28	3
4	Corey Pavin	United States	1,340,079	22	2
5	Steve Elkington	Australia	1,254,352	21	2

Synthetic Examples

We generate syntethic examples using a simple grammar. The grammar randomly generates two SQL-like phrases that can use aggregations or constants and compares one against each other using equality or numeric comparisons. We assign a binary label according to the truth value of the generated statement, and the algorithm is adjusted to get the same number of positives and negatives. The total amount of examples obtained is 3.7 million.

Below are some examples, and we recommend looking at te section 3.2 in the paper to see the full details.

2 is less than wins when Player is Lee Janzen. The right hand side corresponds to the query

SELECT wins FROM table WHERE player = "Lee Janzen"
The sum of Earnings when Country is Australia is 2,909,311. The right hand side corresponds to the query

SELECT SUM(earnings) FROM table WHERE country = "Australia"

Although the language is artificial, these examples can improve the model numerical reasoning skills.

Counterfactual Examples

Counterfactual examples are created by randomly replacing an entity in a statement for a different but plausible entity. The original sentences are also obtained from Wikipedia by selecting text in the vicinity of the table or that has a hyperlink to the table from a different page. For example, if we get a text that reads

Greg Norman has the highest earnings

we replace "Greg Norman" for another entity appearing in the same column of the table, obtaining

~~Greg Norman~~ Steve Elkington has the highest earnings.

The total amount of examples obtained is 4.1 million and the model is then asked to detect whether the sentence has been corrupted or not. The sentences obtained in this manner sound more natural, but the type of logical inference that the model has to perform on the table is typically simpler, since often looking at a single row is enough. You may also check section 3.1 in the paper to see the full details.

Model release

The models trained on these datasets jointly are released with and without the fine-tuning on the end tasks (TabFact, SQA, etc.). All the links with the expected results are listed in the main readme.

Code Release

Data generation code is coming soon.

Licence

This code and data are licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License.
See also the Wikipedia Copyrights page.

How to cite this data and code?

You can cite the paper to appear in EMNLP 2020 Findings.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

INTERMEDIATE_PRETRAIN_DATA.md

INTERMEDIATE_PRETRAIN_DATA.md

Intermediate Pre-Training Data

Synthetic Examples

Counterfactual Examples

Model release

Code Release

Licence

How to cite this data and code?

Files

INTERMEDIATE_PRETRAIN_DATA.md

Latest commit

History

INTERMEDIATE_PRETRAIN_DATA.md

File metadata and controls

Intermediate Pre-Training Data

Synthetic Examples

Counterfactual Examples

Model release

Code Release

Licence

How to cite this data and code?