Tiktoken educational BPE trainer takes long time to train with vocab size 30k #299

sagorbrur · 2024-05-19T06:35:22Z

Hi,
I am trying to train Tiktoken on a custom dataset (size 15 GB) with 30k vocab size. It seems it will take a long time to finish. 1 vocab update took almost 8 hours. Any suggestion to make it faster?
Thanks in advance.

tiktoken/tiktoken/_educational.py

Line 117 in c0ba74c

def bpe_train(

vladimirzenin · 2024-05-20T18:55:49Z

Hi.
I did a little research. I have a dataset of texts in 41mb. "Large dataset". This is a dataset collected from various sources - fiction, wiki, blogs. Small pieces of different topics.
From this dataset I extracted an smaller dataset of 854 kb "Small Dataset". It also contains different topics.

I trained a tokenizer with a size of 6000 tokens on each of these 2 datasets. Here are the comparison results:
"Large dataset" - trained for 6-8 hours on an old CPU. Let's take the result of tokenization as a "standard".
"Small dataset" - trained for about 12 minutes. The tokenization result coincides with the "standard" by 61.8%. I mean set1.intersection(set2) from the received tokens (only the tokens, without their indices).

More than half coincidence. Not much, but not little either. The conclusion seems obvious - you can train a tokenizer on a small sample of data. I think the best result will be achieved if the sample is divided in such a way that it covers all the topics available in the large dataset.

I am not an expert in this field, my conclusions are based on personal attempts to understand the topic of tokenizers.

sagorbrur · 2024-05-21T03:30:06Z

Hi @vladimirzenin ,
Thanks for your input. You right. A small subset will be enough for the tokens. But for our cases, Bengali is a diverse language. We have separated ~20 GB of data to train the tokenizer to grab the actual sub-word understanding. But it seems hard to train with this module.
In respect to your conclusion, for the Bengali language if we separate a small portion it won't even be close to the original distribution of the words. But there might be an efficient way.
Thanks again.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tiktoken educational BPE trainer takes long time to train with vocab size 30k #299

Tiktoken educational BPE trainer takes long time to train with vocab size 30k #299

sagorbrur commented May 19, 2024

vladimirzenin commented May 20, 2024

sagorbrur commented May 21, 2024

Tiktoken educational BPE trainer takes long time to train with vocab size 30k #299

Tiktoken educational BPE trainer takes long time to train with vocab size 30k #299

Comments

sagorbrur commented May 19, 2024

vladimirzenin commented May 20, 2024

sagorbrur commented May 21, 2024