Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to preprocess data from plaintext #935

Closed
Maniues opened this issue May 13, 2023 · 2 comments
Closed

How to preprocess data from plaintext #935

Maniues opened this issue May 13, 2023 · 2 comments

Comments

@Maniues
Copy link

Maniues commented May 13, 2023

I want to test gpt-neox to make a tiny GPT model.
I would like to train it on some plaintext files.
I converted it into JSONL file and... I stuck in Tokenization.
I don't know how to do that. I have only plaintext and I want to make a vocab and merge files from it to put them into preprocess_data. I use non-English data, so I want to make these files myself without downloading existing English vocab/merge.

I used the same plaintext files to make a small GPT model using nanoGPT, but I want to do the same in NeoX, because nanoGPT does not support any other format, and because I want to fine-tune some other models in the future (+ test GPT-NeoX library).

@Maniues
Copy link
Author

Maniues commented May 14, 2023

I resolved it.

  1. There are some mistakes in README, for example 'data-path' is invalid, because it should be *mydataset_text_document, not only *mydataset.
  2. There are no scripts for tokenization data, so I wrote some to make vocab.json and merges.txt from one txt files, and I used converter to JSON lines.
  3. I used preprocess_data without some parameters, only output-prefix, input, tokenizer-type and vocab/merge files path.
  4. There was issue with Megatron.data, so I changed python3-config to python3.9-config or python3.10-config in Makefile and recompiled using make.
  5. Finally, my graph card has not enough memory to be used in NeoX (or I don't know how to configure it), so I moved my code to Colab.

@Maniues Maniues closed this as completed May 14, 2023
@StellaAthena
Copy link
Member

Apologies for the confusion caused by the read me. Could you submit a PR with some corrections?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants