Datasets and codes for the paper "TGEA 2.0: A Large-Scale Diagnostically Annotated Dataset with Benchmark Tasks for Text Generation of Pretrained Language Models".
Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license. (License URL: https://creativecommons.org/licenses/by-sa/4.0/)
Converting raw data to the format of each task
unzip data.zip
python data/convert_raw_data_to_benchmarks.py
python data/convert_gec_format.py
- Erroneous Text Detection
sh Diagnosis_tasks/train_b1.sh
- MiSEW Extraction
sh Diagnosis_tasks/train_b2.sh
- Erroneous Span Location
sh Diagnosis_tasks/train_b3.sh
- Error Type Classification
sh Diagnosis_tasks/train_b4.sh
- Error Correction
sh Diagnosis_tasks/train_b5.sh
m2scorer is used to evaluate results of error correction.
- Generation Pathology Mitigation
sh Generation_Pathology_Mitigation/train_b6.sh
python Generation_Pathology_Mitigation/evaluate.py