This repo contains the source code of the paper: Replacing Language Model for Style Transfer. Our Replacing Language Model (RLM) is a novel sequence-to-sequence generation scheme, which gathers the flexibility of autoregressive models and the accuracy of non-autoregressive models.
- Autoregressive (AR) language models predict the next token based on the input sequence and the generated subsequence, which are flexible to rewrite sentences but difficult to control the content:
- Non-autoregressive (NAR) models (especially masked language models) each time generate serveral parts of sentences, which are accurate to preserve the sentences' backbone but challenged by the generation diversity:
- Our replacing language model (RLM) autoregressively replaces each token in the original sentence with a text span in the target style. In contrast, the new span is generated via a non-autoregressive masked language model:
- pytorch=1.10
- huggingface=4.11.3
- nltk
First, download the raw datasets Yelp
and Amazon
from this repo. The run the following command to convert data into JSON
format and get the masked train data for RLM:
RAW_DATA_DIR=/directory/to/downloaded/raw/data
DATA_DIR=/directory/to/converted/data
python preprocess/preprocess.py\
--raw_data_dir ${RAW_DATA_DIR}\
--data_dir ${DATA_DIR}\
--dataset "yelp" # or Amazon
After preprocessing, you check the generated data files in the ${DATA_DIR}/yelp/
directory:
markers.json
: the style related words learned from the training set.rare_word.json
: the low frequency words detected from the training set.masked_train_data.json
: the masked training data set for learning the RLM model.
To train the replacing language model, run the following commands:
DATA_DIR=/directory/to/converted/data/yelp
SAVE_DIR=/directory/to/save/model
python main.py\
--train_data_path ${DATA_DIR}/masked_train_data.json\
--markers_path ${DATA_DIR}/markers.json\
--rarewords_path ${DATA_DIR}/rare_words.json\
--model_path bert-base-uncased\
--tokenizer_path bert-base-uncased\
--save_dir ${SAVE_DIR}
With a trained RLM model, to evaluate it performance on testing set, run the following commands:
DATA_DIR=/directory/to/converted/data/yelp
TEST_DATA_PATH=test/test_data.json
MODEL_PATH=/directory/to/save/model
OUTPUT_DIR=/directory/to/save/transfer/results
python main.py\
--evaluate\
--test_data_path ${TEST_DATA_PATH}\
--markers_path ${DATA_DIR}/markers.json\
--rarewords_path ${DATA_DIR}/rare_words.json\
--model_path ${MODEL_PATH}\
--tokenizer_path bert-base-uncased\
--output_dir ${OUTPUT_DIR}