Vocabulary

For text inputs, vocabulary files should be provided in the data configuration. A vocabulary file is a simple text file with one token per line. It should start with these 3 special tokens:

<blank>
<s>
</s>

Building vocabularies

The onmt-build-vocab script can be used to generate vocabulary files in multiple ways:

Generate a vocabulary from tokenized training files

If your training data is already tokenized, you can build a vocabulary with the most frequent tokens. For example, the command below extracts the 50,000 most frequent tokens from the files train.txt.tok and other.txt.tok and saves them to vocab.txt:

onmt-build-vocab --save_vocab vocab.txt --size 50000 train.txt.tok other.txt.tok

Instead of defining a fixed size, you can also prune tokens that appear below a minimum frequency. See the --min_frequency option.

Generate a vocabulary from raw training files with on-the-fly tokenization

By default, onmt-build-vocab splits each line on spaces. It is possible to define a custom tokenization with the --tokenizer_config option. See Tokenization for more information.

Convert a SentencePiece vocabulary to OpenNMT-tf

If you trained a SentencePiece model, a vocabulary file was generated in the process. You can convert this vocabulary to work with OpenNMT-tf:

onmt-build-vocab --from_vocab sp.vocab --from_format sentencepiece --save_vocab vocab.txt

Train a SentencePiece model and vocabulary with OpenNMT-tf

The onmt-build-vocab script can also train a new SentencePiece vocabulary and model from raw data. For example the command:

onmt-build-vocab --sentencepiece --size 32000 --save_vocab sp train.txt.raw

will produce the SentencePiece model sp.model and the vocabulary sp.vocab of size 32,000. The vocabulary file is saved in the OpenNMT-tf format and can be directly used for training.

Additional SentencePiece training options can be passed to the --sentencepiece argument in the format option=value, e.g.:

onmt-build-vocab --sentencepiece character_coverage=0.98 num_threads=4 [...]

Configuring vocabularies

In most cases, you should configure vocabularies with source_vocabulary and target_vocabulary in data block of the YAML configuration, for example:

data:
  source_vocabulary: src_vocab.txt
  target_vocabulary: tgt_vocab.txt

However, some models may require a different configuration:

Language models require a single vocabulary:

data:
  vocabulary: vocab.txt

Parallel inputs require indexed vocabularies:

data:
  source_1_vocabulary: src_1_vocab.txt  # Vocabulary of the 1st source input.
  source_2_vocabulary: src_2_vocab.txt  # Vocabulary of the 2nd source input.

Nested parallel inputs require an additional level of indexing:

data:
  source_1_1_vocabulary: src_1_1_vocab.txt
  source_1_2_vocabulary: src_1_2_vocab.txt
  source_2_vocabulary: src_2_vocab.txt

Note: If you train a model with shared embeddings, you should still configure all vocabulary parameters but in this case they should simply point to the same vocabulary file.