Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tiktoken educational BPE trainer takes long time to train with vocab size 30k #299

Open
sagorbrur opened this issue May 19, 2024 · 2 comments

Comments

@sagorbrur
Copy link

Hi,
I am trying to train Tiktoken on a custom dataset (size 15 GB) with 30k vocab size. It seems it will take a long time to finish. 1 vocab update took almost 8 hours. Any suggestion to make it faster?
Thanks in advance.

def bpe_train(

@vladimirzenin
Copy link

Hi.
I did a little research. I have a dataset of texts in 41mb. "Large dataset". This is a dataset collected from various sources - fiction, wiki, blogs. Small pieces of different topics.
From this dataset I extracted an smaller dataset of 854 kb "Small Dataset". It also contains different topics.

I trained a tokenizer with a size of 6000 tokens on each of these 2 datasets. Here are the comparison results:
"Large dataset" - trained for 6-8 hours on an old CPU. Let's take the result of tokenization as a "standard".
"Small dataset" - trained for about 12 minutes. The tokenization result coincides with the "standard" by 61.8%. I mean set1.intersection(set2) from the received tokens (only the tokens, without their indices).

More than half coincidence. Not much, but not little either. The conclusion seems obvious - you can train a tokenizer on a small sample of data. I think the best result will be achieved if the sample is divided in such a way that it covers all the topics available in the large dataset.

I am not an expert in this field, my conclusions are based on personal attempts to understand the topic of tokenizers.

@sagorbrur
Copy link
Author

Hi @vladimirzenin ,
Thanks for your input. You right. A small subset will be enough for the tokens. But for our cases, Bengali is a diverse language. We have separated ~20 GB of data to train the tokenizer to grab the actual sub-word understanding. But it seems hard to train with this module.
In respect to your conclusion, for the Bengali language if we separate a small portion it won't even be close to the original distribution of the words. But there might be an efficient way.
Thanks again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants