-
Notifications
You must be signed in to change notification settings - Fork 983
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Got an empty gpt2-tokenizer while pretraining with THE-PILE dataset #876
Comments
It seems like you’re using the tokenizer incorrectly. Your specifying a vocab file that corresponds to the GPT-NeoX tokenizer but a merges file that corresponds to the GPT-2 tokenizer. Which one are you trying to use? |
I used HFTokenizer to produce the .bin file and .ind file. But I remember that the vocab file and merge file are downloaded from GPT2_VOCAB_URL = "https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json" I have some new progress on this issue.
|
Here is the content of the configuration file:
|
As Stella mentioned, you should choose either HFTokenizer or the gpt2 tokenizer. Your tokenized data and config should all use the same tokenizer/vocab. If you want to use gpt2, use those vocab/merges files to tokenize with If you want to use HFTokenizer like we did for neox-20B, use it to tokenize your data with |
Environment:
Description:
I have tried to use some of THE-PILE dataset to pretrain a toy model followed with README.md.
The problem occurs when I was running the train.py
The whole process were as follows:
The data was 00.jsonl.zst from PILE.
Then it generated 4 files:
However, the empty files lead to the following error:
Outputs
Traceback (most recent call last):
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Traceback (most recent call last):
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Traceback (most recent call last):
File "/opt/data/private/tmp/gpt-neox/megatron/tokenizer/tokenizer.py", line 41, in build_tokenizer [199/1815]
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Traceback (most recent call last):
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Killing subprocess 109450
Killing subprocess 109451
Killing subprocess 109452
Killing subprocess 109453
I wonder why the prepare.py gave me 2 empty files that lead to the error.
Could you tell me how to solve the problem?
If you need more information, I would provide them as soon as possible.
The text was updated successfully, but these errors were encountered: