Source code for the papers:
- Tag v1.0 โ Investigating Text Simplification Evaluation accepted in Findings at ACL-IJCNLP 2021
- Tag v2.0 โ The Role of Text Simplification Operations in Evaluation accepted at CTTS-2021 workshop
By @lmvasquezr, @MattShardlow, Piotr Przybyลa and @SAnaniadou.
If you have any questions, please don't hesitate to contact us. Feel free to submit any issue/enhancement in GitHub as well.
- Analysis of Text Simplification corpora based on simplification operations, using the edit distance measure.
- Creation of better distributed datasets (random and with our heuristic for reduction of incorrect alignments)
- Technical details and modifications done for performance evaluation using EditNTS model.
You will need Python 3.7+ and Java (tested on 15.0.1)
git clone https://github.com/lmvasque/ts-explore.git
cd ts-explore
pip install -r requirements.txt
We have adapted EditNTS model code to run in our setting. You can use this adaptation from the following fork repo from original repo.
- Code migration to Python 3
- Scripts for data preprocessing
- Other minor fixes
Create a json file with the location of the dataset files:
{
"wikismall": {
"test": "<data_dir>/wikismall/PWKP_108016.tag.80.aner.ori.test",
"dev": "<data_dir>/wikismall/PWKP_108016.tag.80.aner.ori.valid",
"train": "<data_dir>/wikismall/PWKP_108016.tag.80.aner.ori.train",
"tag": ["src", "dst"]
}
}
This is an example for wikismall.json, which contains subsets that start with PWKP_108016.tag.80.aner.ori and end with .src and .dst, located in <data_dir>/wikismall/
Edit-distance calculations occur in Java. Open a new terminal and run the following command:
cd ts-explore/java
/bin/bash run.sh
In a new terminal, run from the downloaded git repo:
python ts_eval.py --analysis --datasets examples/wikismall.json --output_dir output
For creating random distributed datasets:
python ts_eval.py --create random --datasets examples/wikismall.json --seed 324 --output_dir output
For creating datasets reduced in poor-alignments (sentences that are aligned incorrectly):
python ts_eval.py --create unaligned --datasets examples/wikismall.json --sample 0.95 --seed 324 --output_dir output
We adapted the original EditNTS model and documented our changes here. Then, we trained our model as follows:
python main.py --vocab_path vocab_data/ --device 0 --data_path datasets/<dataset_dir>/<dataset_train_dev> --store_dir <output_dir> --batch_size 64 --lr 0.001 --vocab_size 30000 --run_training
To run model evaluation:
python main.py --vocab_path vocab_data/ --device 0 --data_path datasets/<dataset_dir>/<dataset_test> --store_dir output/ --load_model output/<model>/checkpoints/<checkpoints_dir> --batch_size 64 --lr 0.001 --vocab_size 30000 --run_eval
๐ Note: Please note that for using this model you need to follow a preprocessing step. We have used the setting for no duplicate sentences. You can refer to the original documention for further details.
If you would like to use our edit-distance algorithm to get the simplification operations, you can run as follows:
- In a separate terminal run the following command to start the Java Server:
git clone https://github.com/lmvasque/ts-explore.git
cd ts-explore/java
./run.sh
- Run the script to obtain the list of operations needed to transform the source sentence into the target sentence.
python count_operations.py --source "The house was painted last week by John ." --target "John painted the house last week ."
- Finally, you will get a list of operations, including the source and target token involved in the operation:
REPLACE,the,john
REPLACE,house,painted
REPLACE,was,the
REPLACE,painted,house
DELETE,by,null
DELETE,john,null
To replicate our results, please download or request the following resources:
- WikiLarge & WikiSmall: from (Zhang and Lapata, 2017) splits.
- Turk Corpus: from (Xu, 2016) splits.
- ASSET: from (Alva-Manchego, 2020) splits. In this dataset, we performed minor transformations to be consistent with other datasets, in which there are spaces between punctuation marks. This is the list of replacements applied:
regex = [(",", " ,"), (".", " . "), ("(", " ( "), (")", " ) ")]
- WikiManual: from (Jiang, 2020) splits. We limited our analysis to sentences labeled as "aligned", we filtered them as follows:
grep -E "^aligned" <file>
- MSD: from (Cao, 2020) splits. The original dataset comes in JSON format, we filtered "text" field from each sentence. We kept every even line as the complex sentence and its corresponding odd line as its simple sentence.
We have created a sample configuration file to replicate our TS datasets analysis. Please use this file and update with the location of the data files. You can run the datasets analysis as follows:
python ts_eval.py --analysis --datasets examples/ts_datasets.json --output_dir output
You will see the following outputs:
- Edit-distance plots under <output_dir>/imgs
- KL divergences between each dataset subsets, this are reported in console
Distribution divergences between Test/Dev subsets
Dataset Value
wikimanual 0.102053
wikilarge 0.462257
wikismall 0.069603
Distribution divergences between Test/Train subsets
Dataset Value
wikimanual 0.017596
wikilarge 0.463852
wikismall 0.057977
๐ Note: For ASSET and TurkCorpus, the KL-divergences were calculated in a different way since these datasets have multiple references. In our experiments, we merged all the references into a single file for each subset (test, dev and train) and then calculated the divergences.
- Datasets files (complex and simple sentences in separate files) under <output_dir>/txt
- Text files with edit-distance calculations under <output_dir>/txt
# Edit distance calculations: Score, Complex, Simple (tab-separated)
4.3478260869565215 She performed for President Reagan in 1988's Great Performances at the White House series , which aired on the Public Broadcasting Service . She performed for Reagan in 1988's Great Performances at the White House series , which aired on the Public Broadcasting Service .
4.545454545454546 This was demonstrated in the Miller-Urey experiment by Stanley L . Miller and Harold C . Urey in 1953 . This was shown in the Miller-Urey experiment by Stanley L . Miller and Harold C . Urey in 1953 .
4.545454545454546 This was substantially complete when Messiaen died , and Yvonne Loriod undertook the final movement's orchestration with advice from George Benjamin . This was mostly complete when Messiaen died , and Yvonne Loriod undertook the final movement's orchestration with advice from George Benjamin .
Use the following command lines to reproduce our datasets.
# Supported values (evaluated in our paper)
# sample: 0.98, 0.95, 0.90, 0.85 and 0.80
# seed: 155, 324, 393, 728, 989
# Wikilarge Random
python ts_eval.py --create random --seed 324 --datasets examples/datasets.wikilarge.json --output_dir output
# Wikilarge 98%
python ts_eval.py --create unaligned --datasets examples/datasets.wikilarge.json --sample 0.98 --seed 324 --output_dir output
# Wikilarge 95%
python ts_eval.py --create unaligned --datasets examples/datasets.wikilarge.json --sample 0.95 --seed 324 --output_dir output
And datasets.wikilarge.json will look like this:
{
"wikilarge": {
"test": "<data_dir>/wikilarge/wiki.full.aner.ori.test",
"dev": "<data_dir>/wikilarge/wiki.full.aner.ori.dev",
"train": "<data_dir>/wikilarge/wiki.full.aner.ori.train",
"tag": ["src", "dst"]
}
}
The same steps apply for WikiSmall dataset, just update the .json file.
๐ Note: The scripts above will recreate the datasets from scratch. We recommend you use this method since they fix minor limitations found in data after publication. If you still want to use the original datasets, you can download from here.
For the datasets analysis and creation, we ran under the following setting:
- Processor Name: 2 GHz Quad-Core Intel Core i5
- Memory: 16 GB
Analysis duration: for all datasets presented in this paper it should take ~5 minutes.
For the model training, we used a different setting, using 1 GPU with the following specs:
- Tesla V100-SXM2-16GB
- CUDA Driver Version = 11.2
Model training duration: ~3-4 hours for WikiSmall and from ~17-22 hours for WikiLarge experiments.
If you use our results and scripts in your research, please cite our work:
Investigating Text Simplification Evaluation: this includes the evaluation of KL-divergences of Wikipedia-based TS datasets and our random (single seed) and poor-alignment (98% and 95%) analysis. These scenarios are evaluated together.
@inproceedings{vasquez-rodriguez-etal-2021-investigating,
title = "Investigating Text Simplification Evaluation",
author = "V{\'a}squez-Rodr{\'\i}guez, Laura and
Shardlow, Matthew and
Przyby{\l}a, Piotr and
Ananiadou, Sophia",
booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",
month = aug,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.findings-acl.77",
pages = "876--882",
}
The Role of Text Simplification Operations in Evaluation: our analysis is extended by adding multiple seeds (5) for random, more poor-alignment scenarios (98%, 95%, 90%, 85%, 80%) and Monte Carlo algorithm analysis. These scenarios are evaluated independently.
@inproceedings{vasquez-rodriguez-etal-2021-the-role,
title = "The Role of Text Simplification Operations in Evaluation",
author = "V{\'a}squez-Rodr{\'\i}guez, Laura and
Shardlow, Matthew and
Przyby{\l}a, Piotr and
Ananiadou, Sophia",
booktitle = "First Workshop on Current Trends in Text Simplification (CTTS 2021)",
month = sep,
year = "2021",
address = "Online",
publisher = "CEUR Workshop Proceedings (CEUR-WS.org)",
url = "https://ceur-ws.org/Vol-2944/paper4.pdf",
pages = "57--69",
}