-
Notifications
You must be signed in to change notification settings - Fork 983
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to preprocess data from plaintext #935
Comments
I resolved it.
|
Apologies for the confusion caused by the read me. Could you submit a PR with some corrections? |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I want to test gpt-neox to make a tiny GPT model.
I would like to train it on some plaintext files.
I converted it into JSONL file and... I stuck in Tokenization.
I don't know how to do that. I have only plaintext and I want to make a vocab and merge files from it to put them into preprocess_data. I use non-English data, so I want to make these files myself without downloading existing English vocab/merge.
I used the same plaintext files to make a small GPT model using nanoGPT, but I want to do the same in NeoX, because nanoGPT does not support any other format, and because I want to fine-tune some other models in the future (+ test GPT-NeoX library).
The text was updated successfully, but these errors were encountered: