-
Notifications
You must be signed in to change notification settings - Fork 983
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
when preprocess data and load data using "lazy" mode #904
Comments
Why do you think this has to do with lazy mode? Have you verified that the checkpoints specified are in fact where you've pointed the program? |
Thank you for your reply! |
I see. I misread your error code at first, let me look into it. @haileyschoelkopf it looks like the core issue here is a failed assertion about data types for the index mappings… could this have snuck in when you were dealing with the dataset size overflowing?
|
Might be--after looking through @Quentin-Anthony 's overflow fix it seems that even when |
I find Megatron-LM issues: NVIDIA/Megatron-LM#170 |
The concatenation functionality they describe as a replacement is also supported in our code, so I’m going to make a minor PR to remove the "lazy" option in preprocessing and close this as completed |
Describe the bug
Using "lazy" mode to process enwik8 data, when I got .bin and .idx file, using these file to pretrain, but I got error.
To Reproduce
Steps to reproduce the behavior:
python tools/preprocess_data.py
--input ./data/enwik8/enwik8.zip
--output-prefix ./data/enwik8/enwik8
--vocab ./data/gpt2-vocab.json
--merge-file gpt2-merges.txt
--dataset-impl lazy
--tokenizer-type GPT2BPETokenizer
--append-eod
modify local_setup.yml > "data-path": "data/enwik8/enwik8_text_document"
modify 125M.yml > "data-impl": "lazy"
run pretrain
Expected behavior
running.
Proposed solution
I don't know why.
Screenshots
Loading checkpoint and starting from iteration 0
[2023-04-25 16:08:08,311] [WARNING] [engine.py:2769:load_checkpoint] Unable to find latest file at checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.
Environment (please complete the following information):
The text was updated successfully, but these errors were encountered: