The full list of datasets with sources, citations, and licenses:
Dataset | Source Link | Citation | License | |
---|---|---|---|---|
Entity Matching | DeepMatcher | [Link] | [1] | BSD 3-clause |
Error Detection | Raha | [Link] | [2] | Apache 2.0 |
TextCLS | AG News | [Link] | [3] | http:https://groups.di.unipi.it/~gulli/AG_corpus_of_news_articles.html |
TextCLS | Amazon | [Link] | [4] | BSD 3-Clause |
TextCLS | ATIS | [Link] | MIT | |
TextCLS | SNIPS | [Link] | [5] | Apache 2.0 |
TextCLS | SST | [Link] | [6] | GNU General Public License |
TextCLS | TREC | [Link] | [7] | |
TextCLS | IMDB | [Link] | [8] | https://www.imdb.com/conditions |
By downloading these datasets you agree with the terms and conditions set by the original licenses listed here. The same licenses also apply to the derived datasets provided in this repo for your convenience.
- Entity Matching: Each dataset is a directory stored under
em/
. Each dataset comes with a training, validation, and test sets calledtrain.txt
,valid.txt
andtest.txt
respectively. Both the clean and dirty versions of datasets are obtained from DeepMatcher. - Error Detection: Each dataset is stored in a directory under
cleaning/
. Each dataset comes in 4 sizes: [50, 100, 150, 200] and each size has 5 splits. For example, the 0-th split of thebeers
dataset of size 100 correspond to the directorycleaning/beers/100_10000/0/
. Each split contains a training, validation, a test set, and an unlabeled set namedtrain.txt
,valid.txt
,test.txt
, andunlabeled.txt
respectively. - Text Classification: Each dataset is stored in a directory under
textcls/
. The training and validation sets comes in size 100, 300, 500, or 1000 (e.g.,train.txt.300
). The training filetrain.txt.full
is used for semi-supervised learning. There is a single test set namedtest.txt
. - For each training set, we pre-computed the invda augmentations in the jsonlines files with suffix
*.augment.jsonl