Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

when preprocess data and load data using "lazy" mode #904

Closed
peiyingxin opened this issue Apr 25, 2023 · 6 comments · Fixed by #1033
Closed

when preprocess data and load data using "lazy" mode #904

peiyingxin opened this issue Apr 25, 2023 · 6 comments · Fixed by #1033
Assignees
Labels
bug Something isn't working

Comments

@peiyingxin
Copy link

Describe the bug
Using "lazy" mode to process enwik8 data, when I got .bin and .idx file, using these file to pretrain, but I got error.

To Reproduce
Steps to reproduce the behavior:

  1. python tools/preprocess_data.py
    --input ./data/enwik8/enwik8.zip
    --output-prefix ./data/enwik8/enwik8
    --vocab ./data/gpt2-vocab.json
    --merge-file gpt2-merges.txt
    --dataset-impl lazy
    --tokenizer-type GPT2BPETokenizer
    --append-eod

  2. modify local_setup.yml > "data-path": "data/enwik8/enwik8_text_document"

  3. modify 125M.yml > "data-impl": "lazy"

  4. run pretrain

Expected behavior
running.

Proposed solution
I don't know why.

Screenshots
Loading checkpoint and starting from iteration 0
[2023-04-25 16:08:08,311] [WARNING] [engine.py:2769:load_checkpoint] Unable to find latest file at checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.

building train, validation, and test datasets ...
dataset split:
train:
document indices in [0, 29311866) total of 29311866 documents
validation:
document indices in [29311866, 30219354) total of 907488 documents
test:
document indices in [30219354, 30249604) total of 30250 documents
WARNING: could not find index map files, building the indices on rank 0 ...
elapsed time to build and save doc-idx mapping (seconds): 11.041577
Traceback (most recent call last):
File "/mnt/home/gpt-neox/train.py", line 27, in
pretrain(neox_args=neox_args)
File "/mnt/home/gpt-neox/megatron/training.py", line 203, in pretrain
) = build_train_valid_test_data_iterators(neox_args=neox_args)
File "/mnt/home/gpt-neox/megatron/data/data_utils.py", line 400, in build_train_valid_test_data_iterators
train_ds, valid_ds, test_ds = build_train_valid_test_datasets(
File "/mnt/home/gpt-neox/megatron/data/data_utils.py", line 139, in build_train_valid_test_datasets
train_dataset = build_dataset(0, "train")
File "/mnt/home/gpt-neox/megatron/data/data_utils.py", line 127, in build_dataset
dataset = GPT2Dataset(
File "/mnt/home/gpt-neox/megatron/data/gpt2_dataset.py", line 52, in init
self.doc_idx, self.sample_idx, self.shuffle_idx = _build_index_mappings(
File "/mnt/home/gpt-neox/megatron/data/gpt2_dataset.py", line 176, in _build_index_mappings
assert sizes.dtype == np.int32
AssertionError
[2023-04-25 16:08:21,601] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3065
[2023-04-25 16:08:21,602] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3066
[2023-04-25 16:08:21,816] [ERROR] [launch.py:324:sigkill_handler] ['/usr/bin/python3.10', '-u', 'train.py', '--local_rank=1', '--deepspeed_config', 'eyJ0cmFpbl9iYXRjaF9zaXplIjogOCwgInRyYWluX21pY3JvX2JhdGNoX3NpemVfcGVyX2dwdSI6IDQsICJvcHRpbWl6ZXIiOiB7InR5cGUiOiAiQWRhbSIsICJwYXJhbXMiOiB7ImxyIjogMC4wMDA2LCAiYmV0YXMiOiBbMC45LCAwLjk1XSwgImVwcyI6IDFlLTA4fX0sICJmcDE2IjogeyJlbmFibGVkIjogdHJ1ZSwgImxvc3Nfc2NhbGUiOiAwLCAibG9zc19zY2FsZV93aW5kb3ciOiAxMDAwLCAiaHlzdGVyZXNpcyI6IDIsICJtaW5fbG9zc19zY2FsZSI6IDF9LCAiZ3JhZGllbnRfY2xpcHBpbmciOiAxLjAsICJ6ZXJvX29wdGltaXphdGlvbiI6IHsic3RhZ2UiOiAxLCAiYWxsZ2F0aGVyX3BhcnRpdGlvbnMiOiB0cnVlLCAiYWxsZ2F0aGVyX2J1Y2tldF9zaXplIjogNTAwMDAwMDAwLCAib3ZlcmxhcF9jb21tIjogdHJ1ZSwgInJlZHVjZV9zY2F0dGVyIjogdHJ1ZSwgInJlZHVjZV9idWNrZXRfc2l6ZSI6IDUwMDAwMDAwMCwgImNvbnRpZ3VvdXNfZ3JhZGllbnRzIjogdHJ1ZX0sICJ3YWxsX2Nsb2NrX2JyZWFrZG93biI6IHRydWV9', '--megatron_config', '/mnt/home/gpt-neox/megatron_config.json'] exits with return code = 1

Environment (please complete the following information):

  • GPUs:2
  • Configs:
@peiyingxin peiyingxin added the bug Something isn't working label Apr 25, 2023
@StellaAthena
Copy link
Member

Why do you think this has to do with lazy mode? Have you verified that the checkpoints specified are in fact where you've pointed the program?

@peiyingxin
Copy link
Author

Why do you think this has to do with lazy mode? Have you verified that the checkpoints specified are in fact where you've pointed the program?

Thank you for your reply!
I have used mmap mode to precess and load enwik8 data, and I have run pretrain process.
when I use lazy mode to process and load enwik8 data, and I got this error, there is no other chang.

@StellaAthena
Copy link
Member

StellaAthena commented Apr 26, 2023

I see. I misread your error code at first, let me look into it.

@haileyschoelkopf it looks like the core issue here is a failed assertion about data types for the index mappings… could this have snuck in when you were dealing with the dataset size overflowing?

File "/mnt/home/gpt-neox/megatron/data/gpt2_dataset.py", line 176, in _build_index_mappings assert sizes.dtype == np.int32 AssertionError

@haileyschoelkopf
Copy link
Contributor

Might be--after looking through @Quentin-Anthony 's overflow fix it seems that even when sample_idx is built in int64 the dtype of sizes should still be np.int32 though. I've never used the "lazy" impl so it could be that this has been broken for longer, would need to look into it by running it locally

@peiyingxin
Copy link
Author

I find Megatron-LM issues: NVIDIA/Megatron-LM#170
It seems that Megatron-LM no longer support lazy dataloader? And have you been used lazy mode?

@StellaAthena
Copy link
Member

The concatenation functionality they describe as a replacement is also supported in our code, so I’m going to make a minor PR to remove the "lazy" option in preprocessing and close this as completed

@StellaAthena StellaAthena self-assigned this Jun 3, 2023
@dashstander dashstander self-assigned this Sep 15, 2023
@dashstander dashstander linked a pull request Sep 15, 2023 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants