How to preprocess data from plaintext #935

Maniues · 2023-05-13T14:16:22Z

I want to test gpt-neox to make a tiny GPT model.
I would like to train it on some plaintext files.
I converted it into JSONL file and... I stuck in Tokenization.
I don't know how to do that. I have only plaintext and I want to make a vocab and merge files from it to put them into preprocess_data. I use non-English data, so I want to make these files myself without downloading existing English vocab/merge.

I used the same plaintext files to make a small GPT model using nanoGPT, but I want to do the same in NeoX, because nanoGPT does not support any other format, and because I want to fine-tune some other models in the future (+ test GPT-NeoX library).

Maniues · 2023-05-14T05:42:24Z

I resolved it.

There are some mistakes in README, for example 'data-path' is invalid, because it should be *mydataset_text_document, not only *mydataset.
There are no scripts for tokenization data, so I wrote some to make vocab.json and merges.txt from one txt files, and I used converter to JSON lines.
I used preprocess_data without some parameters, only output-prefix, input, tokenizer-type and vocab/merge files path.
There was issue with Megatron.data, so I changed python3-config to python3.9-config or python3.10-config in Makefile and recompiled using make.
Finally, my graph card has not enough memory to be used in NeoX (or I don't know how to configure it), so I moved my code to Colab.

StellaAthena · 2023-05-14T12:16:52Z

Apologies for the confusion caused by the read me. Could you submit a PR with some corrections?

Maniues closed this as completed May 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to preprocess data from plaintext #935

How to preprocess data from plaintext #935

Maniues commented May 13, 2023

Maniues commented May 14, 2023

StellaAthena commented May 14, 2023

How to preprocess data from plaintext #935

How to preprocess data from plaintext #935

Comments

Maniues commented May 13, 2023

Maniues commented May 14, 2023

StellaAthena commented May 14, 2023