## Zero-Shot Cross-Lingual Text Classification with Robust Training Code for our EACL-2021 paper ["Improving Zero-Shot Cross-Lingual Transfer Learning via Robust Training"](https://arxiv.org/abs/2104.08645). Most of our code is based on [XTREME](https://github.com/google-research/xtreme). The current codebase only supports `randomized smoothing with data augmentation` because it's the most effective method that has consistent improvements over different tasks in our paper. We will consider adding other methods and baselines in this repository. ### Setup - Python 3.7+ ``` bash install_tools.sh ``` If you encounter any issues when installing the environment, please refer to [XTREME](https://github.com/google-research/xtreme). ### Data Download [data](https://drive.google.com/file/d/184VriHkpfffWPSTcZf6wUm8ezz7RIUsX/view?usp=sharing) and unzip it. It includes the original PAWS-X and XNLI dataset as well as the testing set for the *generalized* setting. Run the following commands to generate the augmented data for randomized smoothing. ``` python perturb.py --task pawsx --input_dir data_generalized --output_dir data_generalized_augment --num 10 python perturb.py --task xnli --input_dir data_generalized --output_dir data_generalized_augment --num 3 ``` ### Training ``` ./scripts/train_pawsx.sh bert-base-multilingual-cased [gpu_id] data_generalized_augment [output_dir] ./scripts/train_xnli.sh bert-base-multilingual-cased [gpu_id] data_generalized_augment [output_dir] ``` ### Evaluataion For standard setting ``` ./scripts/eval_pawsx.sh bert-base-multilingual-cased [gpu_id] data_generalized_augment [output_dir] [model_dir]/checkpoint-best/ ./scripts/eval_xnli.sh bert-base-multilingual-cased [gpu_id] data_generalized_augment [output_dir] [model_dir]/checkpoint-best/ ``` For generalized setting ``` ./scripts/eval_generalized_pawsx.sh bert-base-multilingual-cased [gpu_id] data_generalized_augment [output_dir] [model_dir]/checkpoint-best/ ./scripts/eval_generalized_xnli.sh bert-base-multilingual-cased [gpu_id] data_generalized_augment [output_dir] [model_dir]/checkpoint-best/ ``` ### Citation If you find that the code is useful in your research, please consider citing our paper and the XTREME paper. @inproceedings{Huang2021robust-xlt, author = {Kuan-Hao Huang and Wasi Uddin Ahmad and Nanyun Peng and Kai-Wei Chang}, title = {Improving Zero-Shot Cross-Lingual Transfer Learning via Robust Training}, booktitle = {Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP)}, year = {2021}, } @inproceedings{Hu20xtreme, author = {Junjie Hu and Sebastian Ruder and Aditya Siddhant and Graham Neubig and Orhan Firat and Melvin Johnson}, title = {{XTREME:} {A} Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalisation}, booktitle = {Proceedings of the 37th International Conference on Machine Learning (ICML)}, year = {2020}, }