Skip to content

Latest commit

 

History

History

data

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 

Datasets used in Rotom

The full list of datasets with sources, citations, and licenses:

Dataset Source Link Citation License
Entity Matching DeepMatcher [Link] [1] BSD 3-clause
Error Detection Raha [Link] [2] Apache 2.0
TextCLS AG News [Link] [3] http:https://groups.di.unipi.it/~gulli/AG_corpus_of_news_articles.html
TextCLS Amazon [Link] [4] BSD 3-Clause
TextCLS ATIS [Link] MIT
TextCLS SNIPS [Link] [5] Apache 2.0
TextCLS SST [Link] [6] GNU General Public License
TextCLS TREC [Link] [7]
TextCLS IMDB [Link] [8] https://www.imdb.com/conditions

By downloading these datasets you agree with the terms and conditions set by the original licenses listed here. The same licenses also apply to the derived datasets provided in this repo for your convenience.

Notes

  • Entity Matching: Each dataset is a directory stored under em/. Each dataset comes with a training, validation, and test sets called train.txt, valid.txt and test.txt respectively. Both the clean and dirty versions of datasets are obtained from DeepMatcher.
  • Error Detection: Each dataset is stored in a directory under cleaning/. Each dataset comes in 4 sizes: [50, 100, 150, 200] and each size has 5 splits. For example, the 0-th split of the beers dataset of size 100 correspond to the directory cleaning/beers/100_10000/0/. Each split contains a training, validation, a test set, and an unlabeled set named train.txt, valid.txt, test.txt, and unlabeled.txt respectively.
  • Text Classification: Each dataset is stored in a directory under textcls/. The training and validation sets comes in size 100, 300, 500, or 1000 (e.g., train.txt.300). The training file train.txt.full is used for semi-supervised learning. There is a single test set named test.txt.
  • For each training set, we pre-computed the invda augmentations in the jsonlines files with suffix *.augment.jsonl