ConfNet2Seq: Full length Answer Generation from Spoken Questions

Based on OpenNMT-py: Open-Source Neural Machine Translation

Code base for paper ConfNet2Seq: Full Length Answer Generation from Spoken Questions. The dataset is contained in data directory. train.ques, train.ans, train.tgt contains data triplet (question, factoid answer, target full length answer) in each line respectively.

The codebase is built over OpenNMT

Requirements

All dependencies can be installed via:

pip install -r requirements.txt

Data

The data files are present in data directory. The confusion network and audio files can be downloaded from https://drive.google.com/drive/folders/1nFtsOrSdE5v6Bjsw-B90MBYWvVtFBsYr?usp=sharing

Step 1a: Add padding:

Pad the answer text file.

python add_padding.py --input <answer-filename> --output <padded-answer-filename> --padding <num of pad>

Step 1: Preprocess the data

python preprocess.py -train_confnet data/ques-train.txt -train_src data/ans-train.txt -train_tgt data/tgt-train.txt -valid_confnet data/ques-val.txt -valid_src data/ans-val.txt -valid_tgt data/tgt-val.txt -save_data data/demo --dynamic_dict --share_vocab

The data consists of parallel source (src) and target (tgt) data containing one sentence per line with tokens separated by a space:

ques-train.txt : text file containing path to a question confusion network in each line
ans-train.txt : text file containing the padded factoid answer in each line
tgt-train.txt : text file containing the target full length answer
ques-val.txt : text file containing the path to a validation question confusion network in each line
ans-val.txt : validation text file containing the path to the padded factoid answer in each line
tgt-val.txt : validation text file containing the target full length answer

Validation files are required and used to evaluate the convergence of the training. It usually contains no more than 5000 sentences.

After running the preprocessing, the following files are generated:

demo.train.pt: serialized PyTorch file containing training data
demo.valid.pt: serialized PyTorch file containing validation data
demo.vocab.pt: serialized PyTorch file containing vocabulary data

Internally the system never touches the words themselves, but uses these indices.

Step 2: Train the model

python train.py -data data/demo -save_model demo-model -word_vec_size 300 -model_type lattice -encoder_type brnn -layers 2 -rnn_size 512 \
-data data/demo -batch_size 32 -valid_batch_size 32 -valid_steps 2500 -dropout 0.5 -start_decay_steps 10000 -coverage_attn -copy_attn \
--share_embeddings

Step 3: Translate

python translate.py -model <model_path> --data_type lattice -src  data/ans-test.txt -confnet data/ques-test.txt -tgt data/tgt-test.txt -share_vocab -beam_size 10 -replace_unk -output pred.txt --batch_size 10

To list all run options:

python preprocess.py --help
python train.py --help
python translate.py --help

Acknowledgements

OpenNMT-py is run as a collaborative open-source project. The original OpenNMT-py code was written by Adam Lerer (NYC) to reproduce OpenNMT-Lua using Pytorch.

Citation

@inproceedings{pal2020confnet2seq,
    title={ConfNet2Seq: Full Length Answer Generation from Spoken Questions},
    author={Vaishali Pal and Manish Shrivastava and Laurent Besacier},
    booktitle={Proceedings of the 23rd International Conference on Text, Speech and Dialogue (TSD 2020)},
    year={2020},
    publisher = {Springer}
}

Name		Name	Last commit message	Last commit date
Latest commit History 2,448 Commits
available_models		available_models
config		config
data		data
docs		docs
onmt		onmt
tools		tools
.gitignore		.gitignore
.travis.yml		.travis.yml
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.md		LICENSE.md
README.md		README.md
baseline_confnet_pgen2.png		baseline_confnet_pgen2.png
floyd.yml		floyd.yml
floyd_requirements.txt		floyd_requirements.txt
github_deploy_key_opennmt_opennmt_py.enc		github_deploy_key_opennmt_opennmt_py.enc
preprocess.py		preprocess.py
requirements.opt.txt		requirements.opt.txt
server.py		server.py
setup.py		setup.py
train.py		train.py
translate.py		translate.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ConfNet2Seq: Full length Answer Generation from Spoken Questions

Based on OpenNMT-py: Open-Source Neural Machine Translation

Requirements

Data

Step 1a: Add padding:

Step 1: Preprocess the data

Step 2: Train the model

Step 3: Translate

To list all run options:

Acknowledgements

Citation

About

Releases

Packages

Contributors 139

Languages

License

kolk/ConfnetPointerGenBaseline

Folders and files

Latest commit

History

Repository files navigation

ConfNet2Seq: Full length Answer Generation from Spoken Questions

Based on OpenNMT-py: Open-Source Neural Machine Translation

Requirements

Data

Step 1a: Add padding:

Step 1: Preprocess the data

Step 2: Train the model

Step 3: Translate

To list all run options:

Acknowledgements

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 139

Languages

Packages