Name		Name	Last commit message	Last commit date
parent directory ..
cleaning		cleaning
em		em
textcls		textcls
README.md		README.md
wikitext-idf.dat		wikitext-idf.dat

README.md

Datasets used in Rotom

The full list of datasets with sources, citations, and licenses:

	Dataset	Source Link	Citation	License
Entity Matching	DeepMatcher	[Link]	[1]	BSD 3-clause
Error Detection	Raha	[Link]	[2]	Apache 2.0
TextCLS	AG News	[Link]	[3]	http:https://groups.di.unipi.it/~gulli/AG_corpus_of_news_articles.html
TextCLS	Amazon	[Link]	[4]	BSD 3-Clause
TextCLS	ATIS	[Link]		MIT
TextCLS	SNIPS	[Link]	[5]	Apache 2.0
TextCLS	SST	[Link]	[6]	GNU General Public License
TextCLS	TREC	[Link]	[7]
TextCLS	IMDB	[Link]	[8]	https://www.imdb.com/conditions

By downloading these datasets you agree with the terms and conditions set by the original licenses listed here. The same licenses also apply to the derived datasets provided in this repo for your convenience.

Notes

Entity Matching: Each dataset is a directory stored under em/. Each dataset comes with a training, validation, and test sets called train.txt, valid.txt and test.txt respectively. Both the clean and dirty versions of datasets are obtained from DeepMatcher.
Error Detection: Each dataset is stored in a directory under cleaning/. Each dataset comes in 4 sizes: [50, 100, 150, 200] and each size has 5 splits. For example, the 0-th split of the beers dataset of size 100 correspond to the directory cleaning/beers/100_10000/0/. Each split contains a training, validation, a test set, and an unlabeled set named train.txt, valid.txt, test.txt, and unlabeled.txt respectively.
Text Classification: Each dataset is stored in a directory under textcls/. The training and validation sets comes in size 100, 300, 500, or 1000 (e.g., train.txt.300). The training file train.txt.full is used for semi-supervised learning. There is a single test set named test.txt.
For each training set, we pre-computed the invda augmentations in the jsonlines files with suffix *.augment.jsonl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

README.md

Datasets used in Rotom

Notes

Files

data

Directory actions

More options

Directory actions

More options

Latest commit

History

data

Folders and files

parent directory

README.md

Datasets used in Rotom

Notes