Skip to content

Latest commit

 

History

History
 
 

space-3

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

SPACE3.0

This repository contains code and data for the SIGIR'2022 paper "SPACE-3: Unified Dialog Model Pre-training for Task-Oriented Dialog Understanding and Generation".

Full version with Appendix is here: [PDF]

Abstract

Recently, pre-training methods have shown remarkable success in task-oriented dialog (TOD) systems. However, most existing pre-trained models for TOD focus on either dialog understanding or dialog generation, but not both. In this paper, we propose FORTUNE, a novel unified pre-trained dialog model learning from large-scale dialog corpora with limited annotations, which can be effectively fine-tuned on a wide range of downstream dialog tasks. Specifically, FORTUNE consists of four successive components in a single transformer to maintain a task-flow in TOD systems: (i) a dialog encoding module to encode dialog history, (ii) a dialog understanding module to extract semantic vectors from either user queries or system responses, (iii) a dialog policy module to generate a policy vector that contains high-level semantics of the response, and (iv) a dialog generation module to produce appropriate responses. We design a dedicated pre-training objective for each component. Concretely, we pre-train the dialog encoding module with span mask language modeling to learn contextualized dialog information. To capture the structured dialog semantics, we pre-train the dialog understanding module via a novel tree-induced semi-supervised contrastive learning objective with the help of extra dialog annotations. In addition, we pre-train the dialog policy module by minimizing the L2 distance between its output policy vector and the semantic vector of the response for policy optimization. Finally, the dialog generation model is pre-trained by language modeling. Results show that FORTUNE achieves state-of-the-art performance on eight downstream dialog benchmarks, including intent prediction, dialog state tracking, and end-to-end dialog modeling. We also show that FORTUNE has a stronger few-shot ability than existing models under the low-resource setting.

Main Results

SPACE performs end-to-end dialog modeling, dialog state tracking and intent prediction, which achieves new state-of-the-art results on all eight benchmark datasets including: BANKING77, CLINC150, HWU64, CamRest, In-Car Assistant, MultiWOZ2.0, MultiWOZ2.1 and MultiWOZ2.2.

Intent Prediction BANKING77 CLINC150 HWU64
Accuracy 94.94 97.89 94.14
Dialog State Tracking Joint Goal Accuracy
MultiWOZ2.2 57.50
End-to-End Modeling Inform Success BLEU Combined Score
MultiWOZ2.0 95.30 88.00 19.30 110.95
MultiWOZ2.1 95.60 86.10 19.91 110.76
End-to-End Modeling Match SuccF1 BLEU Combined Score
In-Car Assistant 85.26 83.16 22.92 107.13
CamRest 97.74 88.24 23.68 116.67

Requirements

We use the tokenization tool in SpaCy and you can directly install python packages by commands: pip install "modelscope[nlp]" -f https://modelscope.oss-cn-beijing.aliyuncs.com/releases/repo.html and python -m spacy download en_core_web_sm.

Download the resources punkt and wordnet from following code.

import nltk
nltk.download('punkt')
nltk.download('wordnet')

Preparation

Path Definition

Define your own paths <YOUR_PROJECT_PATH> and <YOUR_SAVE_PATH> in scripts as follows:

PROJECT_NAME="SPACE"  # project name (fixed)
PROJECT_ROOT=<YOUR_PROJECT_PATH>/${PROJECT_NAME}  # root path of this project
SAVE_ROOT=<YOUR_SAVE_PATH>/${PROJECT_NAME}  # root path of model's output

Download DataSet And Model From ModelScope

Download dataset and model files from following code.

from modelscope.hub.snapshot_download import snapshot_download
model_id = 'damo/nlp_space_pretrained-dialog-model'
model_dir = snapshot_download(model_id, cache_dir="./modelscope", revision='v1.0.4')

The parameter cache_dir is the downloaded file directory.

The dataset and model zip files in the directory ./modelscope/damo/nlp_space_pretrained-dialog-model.

DataSet Preparation

You need to unzip the downloaded dataset file data.tar.gz.

The unzipped directory data contains pre-training corpora and five extra task-oriented (TOD) benchmark datasets, which have already been processed.

Pre-training Corpora

The pre-training corpora(including BANKING77, CLINC150 and HWU64) in the unzipped directory data/pre_train contains AnPreDial and UnPreDial.

  • AnPreDial: a new labeled dialog dataset annotated with semantic trees, which contains 32 existing labeled TOD datasets with 3 million turns, ranging from single-turn QA to multi-turn dialogs.
  • UnPreDial: a large-scale unlabeled dialog dataset consisting of 19M utterances with careful processing from 21 online dialog corpora, ranging from online forums to conversational machine reading comprehension.

TOD BenchMark DataSets

The TOD benchmark datasets in the unzipped directory data contains CamRest, In-Car Assistant, MultiWOZ2.0, MultiWOZ2.1 and MultiWOZ2.2.

You need to put the unzipped directory data/MultiWOZ2.2 into the directory SPACE/trippy/data then put the other directories into the project directory SPACE/data for the subsequent training.

Pre-trained Checkpoint

The model preparation SPACE in the downloaded model file model.tar.gz.

  • SPACE: an uncased model (12-layers, 768-hidden, 12-heads, 110M parameters)

You need to unzip the downloaded model file model.tar.gz, then put the unzipped directory model into the project directory SPACE for the further fine-tuning.

Directory Structure