Overview:
- Datasets summary
- PAN Closed-set setup
- PAN Open-set setup
- Datasets statistics
- Original dataset files
- Reddit datasets
If you use these dataset splits, please cite both papers:
@article{DBLP:journals/corr/abs-2112-05125,
author = {Andrei Manolache and
Florin Brad and
Elena Burceanu and
Antonio Barbalau and
Radu Tudor Ionescu and
Marius Popescu},
title = {Transferring BERT-like Transformers' Knowledge for Authorship Verification},
journal = {CoRR},
volume = {abs/2112.05125},
year = {2021},
url = {https://arxiv.org/abs/2112.05125},
eprinttype = {arXiv},
eprint = {2112.05125},
timestamp = {Mon, 13 Dec 2021 17:51:48 +0100},
biburl = {https://dblp.org/rec/journals/corr/abs-2112-05125.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
@inproceedings{Kestemont2020OverviewOT,
author = {Mike Kestemont and
Enrique Manjavacas and
Ilia Markov and
Janek Bevendorff and
Matti Wiegmann and
Efstathios Stamatatos and
Martin Potthast and
Benno Stein},
editor = {Linda Cappellato and
Carsten Eickhoff and
Nicola Ferro and
Aur{\'{e}}lie N{\'{e}}v{\'{e}}ol},
title = {Overview of the Cross-Domain Authorship Verification Task at {PAN}
2020},
booktitle = {Working Notes of {CLEF} 2020 - Conference and Labs of the Evaluation
Forum, Thessaloniki, Greece, September 22-25, 2020},
series = {{CEUR} Workshop Proceedings},
volume = {2696},
publisher = {CEUR-WS.org},
year = {2020},
url = {https://ceur-ws.org/Vol-2696/paper\_264.pdf},
timestamp = {Tue, 27 Oct 2020 17:12:48 +0100},
biburl = {https://dblp.org/rec/conf/clef/KestemontMMBWSP20.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
dataset | split type | filename |
---|---|---|
PAN 2020 | closed-set v1 and v2 | pan2020_closed_set_splits.zip |
PAN 2020 | open-set unseen authors | pan2020_open_set_unseen_authors_splits.zip |
PAN 2020 | open-set unseen fandoms | pan2020_open_set_unseen_fandoms_splits.zip |
PAN 2020 | open-set unseen all | pan2020_open_set_unseen_all.zip |
open-set unseen authors | minidarkreddit_authorship_verification.zip |
In the closed-set setup authors of same-author pairs in the validation/test set are guaranteed to appear in the training set. However, this is difficult to achieve for the different-author pairs of the PAN 2020 dataset, as they span a large number of authors with few occurences each.
Download pan2020_closed_set_splits.zip
and unzip it. This is the structure of its content:
xl/
v1_split/
pan20-av-large-test.jsonl
pan20-av-large-notest.jsonl
v2_split/
pan20-av-large-test.jsonl
pan20-av-large-notest.jsonl
xs/
v1_split/
pan20-av-small-test.jsonl
pan20-av-small-notest.jsonl
v2_split/
pan20-av-small-test.jsonl
pan20-av-small-notest.jsonl
We try two variants of splitting the datasets, called v1
and v2
. The splits for the PAN 2020 large dataset can be found in the xl
folder, while the splits for the PAN 2020 small dataset can be found in the xs
folder.
In this version, authors of same-author pairs in the validation set are guaranteed to appear in the training set, while some authors of different-author pairs in the validation set may not appear in the training set.
Here are some dataset statistics:
dataset | filename | size | SA / SA-SF / SA-DF | DA / DA-SF / DA-DF |
---|---|---|---|---|
PAN 2020 large - original | pan20-av-large.jsonl |
275565 | - | - |
PAN 2020 large - test | pan20-av-large-test.jsonl |
13784 | 7395/0/7395 | 6389/1114/5275 |
PAN 2020 large - w/o test | pan20-av-large-no-test.jsonl |
261784 | - | - |
PAN 2020 large - train | pan20-av-large-train.jsonl |
248688 | 133359/0/133359 | 115329/20945/94384 |
PAN 2020 large - val | pan20-av-large-val.jsonl |
13090 | 7024/0/7024 | 6069/1072/4997 |
where: |
- SA: same-author pairs
- SA-SF: same-author pairs that have the same fandom
- SA-DF: same-author pairs that have different fandoms
- DA: different-author pairs
- DA-SF: different-author pairs that have the same fandom
- DA-DF: different-author pairs that have different fandoms
To split the PAN 2020 large dataset (pan20-av-large-notest.jsonl
) into train (pan20-av-large-train.jsonl
) and validation (pan20-av-large-val.jsonl
) splits using the v1
version, call the split_jsonl_dataset
function in preprocess/split_train_val.py
:
cd preprocess
python split_train_val.py
Make sure to specify the correct paths:
path_to_train_jsonl
is where you want to save your training splitpath_to_test_jsonl
is where you want to save your validation split
# split Train dataset into Train and Val
split_jsonl_dataset(
path_to_original_jsonl=pan20-av-notest.jsonl,
path_to_train_jsonl=pan20-av-large-train.jsonl,
path_to_test_jsonl=pan20-av-large-val.jsonl,
split_function=split_pan_dataset_closed_set_v1,
test_split_percentage=0.05
)
Different-author pairs:
- the DA pairs are randomly assigned to train/val split
- unseen authors can appear at evaluation, for instance (A1, A2) in training set and (A3, A4) in val set.
Same-author (SA) pairs:
- while populating the validation split, SA pairs are evenly assigned to train/val splits
- for instance, if we have 10 SA examples from a given author, we assign 5 examples to training split and 5 examples to validation split. This ensures that the author of SA pairs in the validation split has been 'seen' at training time*.
- *this may result in unseen fandoms at validation time though, for instance (A1, F1, A1, F2) at training time and (A1, F3, A1, F4) at validation
If we separate the DA pairs (ai, aj) into two groups Train and Test, such that both authors (ai, aj) of DA pairs in Test also appear in DA pairs in Train (or SA pairs), we get the following stats:
- Number of DA pairs: 127787
- Number of candidate test pairs: 181
- Number of candidate train pairs: 127606
The small number of candidate test pairs suggest that most of the authors in the DA pairs of the test split are 'unseen' at training time. To loosen this restriction, we can split the DA pairs such that at least one of the authors (ai, aj) in a DA Test pair appears in other DA Train pairs or in SA pairs. We get the following stats:
- Number of DA pairs: 127787
- Number of candidate test pairs: 17894
- Number of candidate train pairs: 109893
We therefore split a PAN dataset into Train and Val/Test such that at least one of the authors in DA Test pairs appears DA train pairs or SA train pairs. The SA pairs of an author A are equally distributed between Train and Test.
Here are some dataset statistics:
dataset | filename | size | SA / SA-SF / SA-DF | DA / DA-SF / DA-DF |
---|---|---|---|---|
PAN 2020 large - original | pan20-av-large.jsonl |
275565 | - | - |
PAN 2020 large - test | pan20-av-large-test.jsonl |
13785 | 7396/0/7396 | 6389/355/6034 |
PAN 2020 large - w/o test | pan20-av-large-no-test.jsonl |
261784 | - | - |
PAN 2020 large - train | pan20-av-large-train.jsonl |
248688 | 133359/0/133359 | 115329/22420/92909 |
PAN 2020 large - val | pan20-av-large-val.jsonl |
13090 | 7023/0/7023 | 6069/356/5713 |
To split the PAN 2020 large dataset (pan20-av-large-notest.jsonl
) into train and validation splits using the v2
version, call the split_jsonl_dataset
function in preprocess/split_train_val.py
:
cd preprocess
python split_train_val.py
Make sure to specify the correct paths and split function:
# split Train dataset into Train and Val
split_jsonl_dataset(
path_to_original_jsonl=pan20-av-notest.jsonl,
path_to_train_jsonl=pan20-av-large-train.jsonl,
path_to_test_jsonl=pan20-av-large-val,
split_function=split_pan_dataset_closed_set_v2,
test_split_percentage=0.05
)
In the open-set setup, authors and fandoms in the test set do not appear in the training set. This is difficult to achieve simultanously, so we have create 2 splits: unseen authors and unseen fandoms.
In this split, authors in the test set do not appear in the training set. However, this is difficult to achieve for the PAN 2020 dataset, so we split it into train and val/test sets such that:
- authors of same-author (SA) pairs in the test set do not appear in SA training pairs
- some authors (<5%) of different-author (DA) pairs in the test set may appear in the DA training pairs
- most of the fandoms in the test set appear in the training set
Download pan2020_open_set_unseen_authors_splits.zip
and unzip it. This is the structure of its content:
unseen_authors/
xl/
pan20-av-large-test.jsonl
pan20-av-large-notest.jsonl
xs/
pan20-av-small-test.jsonl
pan20-av-small-notest.jsonl
Here are some dataset statistics:
dataset | filename | size | SA / SA-SF / SA-DF | DA / DA-SF / DA-DF |
---|---|---|---|---|
PAN 2020 large - original | pan20-av-large.jsonl |
275565 | - | - |
PAN 2020 large - test | pan20-av-large-test.jsonl |
13777 | 7388/0/7388 | 6389/2061/4328 |
PAN 2020 large - w/o test | pan20-av-large-no-test.jsonl |
261788 | 140390/0/140390 | 121398/21070/100328 |
PAN 2020 large - train | pan20-av-large-train.jsonl |
248699 | 133367/0/133367 | 115332/18840/96492 |
PAN 2020 large - val | pan20-av-large-val.jsonl |
13089 | 7023/0/7023 | 6066/2230/3836 |
To split the PAN 2020 large dataset (pan20-av-large-notest.jsonl
) into train and validation splits, call the split_jsonl_dataset
function in preprocess/split_train_val.py
:
cd preprocess
python split_train_val.py
Make sure to specify the correct paths and split function:
# split Train dataset into Train and Val
split_jsonl_dataset(
path_to_original_jsonl=pan20-av-notest.jsonl,
path_to_train_jsonl=pan20-av-large-train.jsonl,
path_to_test_jsonl=pan20-av-large-val,
split_function=split_pan_dataset_open_set_unseen_authors,
test_split_percentage=0.05
)
In this split type:
- examples at test/val time belong to fandoms that have not been seen during training
- some authors in the val/test set may also appear in the train set
- training examples (d1, d2, f1, f2) where either f1 or f2 appear in the test fandoms are dropped => this results in ~110K fewer training examples
All train/val/test splits are provided.
Download pan2020_open_set_unseen_fandoms_splits.zip
and unzip it. This is the structure of its content:
unseen_fandoms/
xl/
pan20-av-large-train.jsonl
pan20-av-large-val.jsonl
pan20-av-large-test.jsonl
xs/
pan20-av-small-train.jsonl
pan20-av-small-val.jsonl
pan20-av-small-test.jsonl
Here are some XL dataset statistics:
dataset | filename | size | SA / SA-SF / SA-DF | DA / DA-SF / DA-DF |
---|---|---|---|---|
PAN 2020 XL - original | pan20-av-large.jsonl |
275565 | 147778/0/147778 | 127787/23131/104656 |
PAN 2020 XL - train | pan20-av-large-train.jsonl |
133990 | 71826/0/71826 | 62164/20779/41385 |
PAN 2020 XL - val | pan20-av-large-val.jsonl |
13451 | 7047/0/7047 | 6408/1176/5232 |
PAN 2020 XL - test | pan20-av-large-test.jsonl |
13453 | 7056/0/7056 | 6409/1176/5233 |
Here are some XS dataset statistics:
dataset | filename | size | SA / SA-SF / SA-DF | DA / DA-SF / DA-DF |
---|---|---|---|---|
PAN 2020 XS - original | pan20-av-small.jsonl |
52601 | 27834/0/27834 | 24767/0/24767 |
PAN 2020 XS - train | pan20-av-small-train.jsonl |
36859 | 22547/0/22547 | 14312/0/14312 |
PAN 2020 XS - val | pan20-av-small-val.jsonl |
4179 | 2568/0/2568 | 1393/0/1393 |
PAN 2020 XS - test | pan20-av-small-test.jsonl |
4180 | 2719/0/2719 | 1394/0/1394 |
To split the PAN 2020 original dataset (pan20-av-*.jsonl
) into train/validation/test splits, call the split_jsonl_dataset_into_train_val_test
function in preprocess/split_train_val.py
:
cd preprocess
python split_train_val.py
For the XS dataset:
split_jsonl_dataset_into_train_val_test(
path_to_original_jsonl=paths_dict['original'],
path_to_train_jsonl=paths_dict['train'],
path_to_val_jsonl=paths_dict['val'],
path_to_test_jsonl=paths_dict['test'],
split_function=split_pan_small_dataset_open_set_unseen_fandoms,
test_split_percentage=0.2
)
For the XL dataset:
split_jsonl_dataset_into_train_val_test(
path_to_original_jsonl=paths_dict['original'],
path_to_train_jsonl=paths_dict['train'],
path_to_val_jsonl=paths_dict['val'],
path_to_test_jsonl=paths_dict['test'],
split_function=split_pan_dataset_open_set_unseen_fandoms,
test_split_percentage=0.1
)
In this split type:
- authors and fandoms in the test set have not been seen in the training data
- authors in validation set have not been seen in the training set, but validation fandoms are similar to the training fandoms
All train/val/test splits are provided.
Download pan2020_open_set_unseen_all_splits.zip
and unzip it. This is the structure of its content:
unseen_all/
xl/
pan20-av-large-train.jsonl
pan20-av-large-val.jsonl
pan20-av-large-test.jsonl
xs/
pan20-av-small-train.jsonl
pan20-av-small-val.jsonl
pan20-av-small-test.jsonl
Here are some XL dataset statistics:
dataset | filename | size | SA / SA-SF / SA-DF | DA / DA-SF / DA-DF |
---|---|---|---|---|
PAN 2020 XL - original | pan20-av-large.jsonl |
275565 | 147778/0/147778 | 127787/23131/104656 |
PAN 2020 XL - train | pan20-av-large-train.jsonl |
248001 | 124000/0/124000 | 124001/62286/61715 |
PAN 2020 XL - val | pan20-av-large-val.jsonl |
13703 | 6852/0/6852 | 6851/2966/3885 |
PAN 2020 XL - test | pan20-av-large-test.jsonl |
13704 | 6853/0/6853 | 6851/1633/5218 |
Here are some XS dataset statistics:
dataset | filename | size | SA / SA-SF / SA-DF | DA / DA-SF / DA-DF |
---|---|---|---|---|
PAN 2020 XS - original | pan20-av-small.jsonl |
52601 | 27834/0/27834 | 24767/0/24767 |
PAN 2020 XS - train | pan20-av-small-train.jsonl |
36851 | 18425/0/18425 | 18426/31/18395 |
PAN 2020 XS - val | pan20-av-small-val.jsonl |
4003 | 2002/0/2002 | 2001/2/1999 |
PAN 2020 XS - test | pan20-av-small-test.jsonl |
4001 | 2000/0/2000 | 2001/3/1998 |
To split the PAN 2020 original dataset (pan20-av-*.jsonl
) into train/validation/test splits, call the split_jsonl_dataset_resampling
function in preprocess/split_train_val.py
:
cd preprocess
python split_train_val.py
For the XL dataset:
train_examples, val_examples, test_examples = split_jsonl_dataset_resampling(
path_to_original_jsonl=None,
path_to_authors_json=paths_dict['authors_dict'],
path_to_train_jsonl=paths_dict['train'],
path_to_val_jsonl=paths_dict['val'],
path_to_test_jsonl=paths_dict['test'],
train_size=248000,
test_size=13700
)
The PAN 2020 large dataset has 275.565 examples, detailed here:
same fandom | cross-fandom | |
---|---|---|
same-author pairs | 0 | 147.778 |
different-author pairs | 23.131 | 104.656 |
- same-author pairs are constructed from 41.370 authors, while different-author pairs are constructed from 251.503 authors
- 14.704 authors in SA pairs can be found in DA pairs as well
- 3.966 authors in DA pairs appear in at least one DA pair
- author tuples (Ai, Aj) in DA pairs are unique (i.e. authors 532 and 7145 can be found in this combination only once in DA pairs)
- there are 494.236 distinct documents
We now detail the closed-set and open-set setups. In both setups, we split the XL dataset into 95% training and 5% test and the XS dataset into 90% training and 10% test.
The PAN 2020 small dataset has 52.601 examples, detailed here:
same fandom | cross-fandom | |
---|---|---|
same-author pairs | 0 | 27.834 |
different-author pairs | 0 | 24.767 |
dataset | original examples file | original ground truth file | merged file |
---|---|---|---|
PAN 2020 XS | pan20-authorship-verification-training-small.jsonl |
pan20-authorship-verification-training-small-truth.jsonl |
pan20-av-small.jsonl |
PAN 2020 XL | pan20-authorship-verification-training-large.jsonl |
pan20-authorship-verification-training-large-truth.jsonl |
pan20-av-large.jsonl |
We concatenate the original data and ground truth files into a single file pan20-av-*.jsonl
by calling the merge_data_and_labels()
function.
Since the .jsonl
files are quite large, we use the write_jsonl_to_folder()
function to store examples from pan20-av-*.jsonl
into separate .json
files inside a folder.
dataset | train/val/test sizes |
---|---|
reddit closed set | 284/486/558 |
reddit open set (unseen authors) | 204/412/412 |
TODO