Skip to content

IndoToD: A Multi-Domain Indonesian Benchmark For End-to-End Task-Oriented Dialogue Systems

License

Notifications You must be signed in to change notification settings

dehanalkautsar/IndoToD

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

IndoToD: A Multi-Domain Indonesian Benchmark For End-to-End Task-Oriented Dialogue Systems

This is the dataset repository of IndoToD, presented at SEALP 2023, colocated with AACL 2023, where our paper was awarded with the Best Paper 🏆 [ACL Anthology].

This code has been written using PyTorch. If you use source codes or datasets included in this repository in your work, please cite the following paper:

@inproceedings{kautsar2023indotod,
  title={IndoToD: A Multi-Domain Indonesian Benchmark For End-to-End Task-Oriented Dialogue Systems},
  author={Kautsar, Muhammad and Nurdini, Rahmah and Cahyawijaya, Samuel and Winata, Genta and Purwarianti, Ayu},
  booktitle={Proceedings of the First Workshop in South East Asian Language Processing},
  pages={85--99},
  year={2023}
}

Summary

We introduce IndoToD, a high-quality bilingual multi-domain task-oriented dialogue system data for Indonesian and English. It comprises two datasets:

Overall, it has four different domains by delexicalization to efficiently reduce the size of annotations. To ensure a high-quality data collection, we hire native speakers to manually annotate the dialogues. We annotated the data from existing English ToD datasets: CamRest and SMD. Along with the original English datasets, these new Indonesian datasets serve as an effective benchmark for evaluating Indonesian and English ToD systems as well as exploring the potential benefits of cross-lingual and bilingual transfer learning approaches.

IndoCamRest

IndoCamRest is a task-oriented dialogue system dataset that translated from Cambridge Restaurant 676 (CamRest) dataset.

IndoSMD

IndoSMD is a task-oriented dialogue system dataset that translated from In-Car Assistant (SMD) dataset.

Results

We set up a benchmark for both Indonesian and English ToD to evaluate the performance of the current ToD systems in monolingual, cross-lingual, and bilingual tasks.

Indonesian test set

English test set

License

The datasets are under CC-BY-SA 4.0 and the code is license under Apache 2.0.

About

IndoToD: A Multi-Domain Indonesian Benchmark For End-to-End Task-Oriented Dialogue Systems

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published