This is an implementation of CopyNet which extends the functionality of encoder-decoder models to allow the generation of output sequences that contain "out of vocabulary" tokens that are present in the input sequence.
Dependencies: pytorch numpy tensorboardX (for logging) tqdm (for logging) spacy (for tokenization)
The model is trained on sequence pairs. Create a directory to hold training files. Each file should have 2 lines of text. The first is the input sequence, the second is the target output sequnce. The tokens in each sequence should be seperated by spaces. I used spacy to tokenize the training data so the SequencePairDataset class as well as the evaluation methods assume that spacy will be used. If you want to use a different tokenizer be sure to update those files accordingly.
Train the model using the script. Most hyperparameters can be tuned with command line arguments documented in the training script.