-
Notifications
You must be signed in to change notification settings - Fork 754
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tiktoken educational BPE trainer takes long time to train with vocab size 30k #299
Comments
Hi. I trained a tokenizer with a size of 6000 tokens on each of these 2 datasets. Here are the comparison results: More than half coincidence. Not much, but not little either. The conclusion seems obvious - you can train a tokenizer on a small sample of data. I think the best result will be achieved if the sample is divided in such a way that it covers all the topics available in the large dataset. I am not an expert in this field, my conclusions are based on personal attempts to understand the topic of tokenizers. |
Hi @vladimirzenin , |
Hi,
I am trying to train Tiktoken on a custom dataset (size 15 GB) with 30k vocab size. It seems it will take a long time to finish. 1 vocab update took almost 8 hours. Any suggestion to make it faster?
Thanks in advance.
tiktoken/tiktoken/_educational.py
Line 117 in c0ba74c
The text was updated successfully, but these errors were encountered: