Vocabulary
For text inputs, vocabulary files should be provided in the data configuration. A vocabulary file is a simple text file with one token per line. It should start with these 3 special tokens:
<blank>
<s>
</s>
Building vocabularies
The onmt-build-vocab
script can be used to generate vocabulary files in multiple ways:
Generate a vocabulary from tokenized training files
If your training data is already tokenized, you can build a vocabulary with the most frequent tokens. For example, the command below extracts the 50,000 most frequent tokens from the files train.txt.tok
and other.txt.tok
and saves them to vocab.txt
:
onmt-build-vocab --save_vocab vocab.txt --size 50000 train.txt.tok other.txt.tok
Instead of defining a fixed size, you can also prune tokens that appear below a minimum frequency. See the --min_frequency
option.
Generate a vocabulary from raw training files with on-the-fly tokenization
By default, onmt-build-vocab
splits each line on spaces. It is possible to define a custom tokenization with the --tokenizer_config
option. See Tokenization for more information.
Convert a SentencePiece vocabulary to OpenNMT-tf
If you trained a SentencePiece model, a vocabulary file was generated in the process. You can convert this vocabulary to work with OpenNMT-tf:
onmt-build-vocab --from_vocab sp.vocab --from_format sentencepiece --save_vocab vocab.txt
Train a SentencePiece model and vocabulary with OpenNMT-tf
The onmt-build-vocab
script can also train a new SentencePiece vocabulary and model from raw data. For example the command:
onmt-build-vocab --sentencepiece --size 32000 --save_vocab sp train.txt.raw
will produce the SentencePiece model sp.model
and the vocabulary sp.vocab
of size 32,000. The vocabulary file is saved in the OpenNMT-tf format and can be directly used for training.
Additional SentencePiece training options can be passed to the --sentencepiece
argument in the format option=value
, e.g.:
onmt-build-vocab --sentencepiece character_coverage=0.98 num_threads=4 [...]
Configuring vocabularies
In most cases, you should configure vocabularies with source_vocabulary
and target_vocabulary
in data
block of the YAML configuration, for example:
data:
source_vocabulary: src_vocab.txt
target_vocabulary: tgt_vocab.txt
However, some models may require a different configuration:
Language models require a single vocabulary:
data:
vocabulary: vocab.txt
Parallel inputs require indexed vocabularies:
data:
source_1_vocabulary: src_1_vocab.txt # Vocabulary of the 1st source input.
source_2_vocabulary: src_2_vocab.txt # Vocabulary of the 2nd source input.
Nested parallel inputs require an additional level of indexing:
data:
source_1_1_vocabulary: src_1_1_vocab.txt
source_1_2_vocabulary: src_1_2_vocab.txt
source_2_vocabulary: src_2_vocab.txt
Note: If you train a model with shared embeddings, you should still configure all vocabulary parameters but in this case they should simply point to the same vocabulary file.