when preprocess data and load data using "lazy" mode #904

peiyingxin · 2023-04-25T08:25:23Z

Describe the bug
Using "lazy" mode to process enwik8 data, when I got .bin and .idx file, using these file to pretrain, but I got error.

To Reproduce
Steps to reproduce the behavior:

python tools/preprocess_data.py
--input ./data/enwik8/enwik8.zip
--output-prefix ./data/enwik8/enwik8
--vocab ./data/gpt2-vocab.json
--merge-file gpt2-merges.txt
--dataset-impl lazy
--tokenizer-type GPT2BPETokenizer
--append-eod
modify local_setup.yml > "data-path": "data/enwik8/enwik8_text_document"
modify 125M.yml > "data-impl": "lazy"
run pretrain

Expected behavior
running.

Proposed solution
I don't know why.

Screenshots
Loading checkpoint and starting from iteration 0
[2023-04-25 16:08:08,311] [WARNING] [engine.py:2769:load_checkpoint] Unable to find latest file at checkpoints/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint.

building train, validation, and test datasets ...
dataset split:
train:
document indices in [0, 29311866) total of 29311866 documents
validation:
document indices in [29311866, 30219354) total of 907488 documents
test:
document indices in [30219354, 30249604) total of 30250 documents
WARNING: could not find index map files, building the indices on rank 0 ...
elapsed time to build and save doc-idx mapping (seconds): 11.041577
Traceback (most recent call last):
File "/mnt/home/gpt-neox/train.py", line 27, in
pretrain(neox_args=neox_args)
File "/mnt/home/gpt-neox/megatron/training.py", line 203, in pretrain
) = build_train_valid_test_data_iterators(neox_args=neox_args)
File "/mnt/home/gpt-neox/megatron/data/data_utils.py", line 400, in build_train_valid_test_data_iterators
train_ds, valid_ds, test_ds = build_train_valid_test_datasets(
File "/mnt/home/gpt-neox/megatron/data/data_utils.py", line 139, in build_train_valid_test_datasets
train_dataset = build_dataset(0, "train")
File "/mnt/home/gpt-neox/megatron/data/data_utils.py", line 127, in build_dataset
dataset = GPT2Dataset(
File "/mnt/home/gpt-neox/megatron/data/gpt2_dataset.py", line 52, in init
self.doc_idx, self.sample_idx, self.shuffle_idx = _build_index_mappings(
File "/mnt/home/gpt-neox/megatron/data/gpt2_dataset.py", line 176, in _build_index_mappings
assert sizes.dtype == np.int32
AssertionError
[2023-04-25 16:08:21,601] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3065
[2023-04-25 16:08:21,602] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3066
[2023-04-25 16:08:21,816] [ERROR] [launch.py:324:sigkill_handler] ['/usr/bin/python3.10', '-u', 'train.py', '--local_rank=1', '--deepspeed_config', 'eyJ0cmFpbl9iYXRjaF9zaXplIjogOCwgInRyYWluX21pY3JvX2JhdGNoX3NpemVfcGVyX2dwdSI6IDQsICJvcHRpbWl6ZXIiOiB7InR5cGUiOiAiQWRhbSIsICJwYXJhbXMiOiB7ImxyIjogMC4wMDA2LCAiYmV0YXMiOiBbMC45LCAwLjk1XSwgImVwcyI6IDFlLTA4fX0sICJmcDE2IjogeyJlbmFibGVkIjogdHJ1ZSwgImxvc3Nfc2NhbGUiOiAwLCAibG9zc19zY2FsZV93aW5kb3ciOiAxMDAwLCAiaHlzdGVyZXNpcyI6IDIsICJtaW5fbG9zc19zY2FsZSI6IDF9LCAiZ3JhZGllbnRfY2xpcHBpbmciOiAxLjAsICJ6ZXJvX29wdGltaXphdGlvbiI6IHsic3RhZ2UiOiAxLCAiYWxsZ2F0aGVyX3BhcnRpdGlvbnMiOiB0cnVlLCAiYWxsZ2F0aGVyX2J1Y2tldF9zaXplIjogNTAwMDAwMDAwLCAib3ZlcmxhcF9jb21tIjogdHJ1ZSwgInJlZHVjZV9zY2F0dGVyIjogdHJ1ZSwgInJlZHVjZV9idWNrZXRfc2l6ZSI6IDUwMDAwMDAwMCwgImNvbnRpZ3VvdXNfZ3JhZGllbnRzIjogdHJ1ZX0sICJ3YWxsX2Nsb2NrX2JyZWFrZG93biI6IHRydWV9', '--megatron_config', '/mnt/home/gpt-neox/megatron_config.json'] exits with return code = 1

Environment (please complete the following information):

GPUs:2
Configs:

StellaAthena · 2023-04-25T16:16:40Z

Why do you think this has to do with lazy mode? Have you verified that the checkpoints specified are in fact where you've pointed the program?

peiyingxin · 2023-04-26T07:28:14Z

Why do you think this has to do with lazy mode? Have you verified that the checkpoints specified are in fact where you've pointed the program?

Thank you for your reply！
I have used mmap mode to precess and load enwik8 data, and I have run pretrain process.
when I use lazy mode to process and load enwik8 data, and I got this error, there is no other chang.

StellaAthena · 2023-04-26T14:03:57Z

I see. I misread your error code at first, let me look into it.

@haileyschoelkopf it looks like the core issue here is a failed assertion about data types for the index mappings… could this have snuck in when you were dealing with the dataset size overflowing?

File "/mnt/home/gpt-neox/megatron/data/gpt2_dataset.py", line 176, in _build_index_mappings assert sizes.dtype == np.int32 AssertionError

haileyschoelkopf · 2023-04-26T14:44:07Z

Might be--after looking through @Quentin-Anthony 's overflow fix it seems that even when sample_idx is built in int64 the dtype of sizes should still be np.int32 though. I've never used the "lazy" impl so it could be that this has been broken for longer, would need to look into it by running it locally

peiyingxin · 2023-04-27T01:46:04Z

I find Megatron-LM issues: NVIDIA/Megatron-LM#170
It seems that Megatron-LM no longer support lazy dataloader? And have you been used lazy mode?

StellaAthena · 2023-06-03T14:07:17Z

The concatenation functionality they describe as a replacement is also supported in our code, so I’m going to make a minor PR to remove the "lazy" option in preprocessing and close this as completed

peiyingxin added the bug Something isn't working label Apr 25, 2023

StellaAthena self-assigned this Jun 3, 2023

dashstander self-assigned this Sep 15, 2023

dashstander linked a pull request Sep 15, 2023 that will close this issue

Remove support for lazy dataset implementation #1033

Merged

Quentin-Anthony closed this as completed in #1033 Sep 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

when preprocess data and load data using "lazy" mode #904

when preprocess data and load data using "lazy" mode #904

peiyingxin commented Apr 25, 2023

StellaAthena commented Apr 25, 2023

peiyingxin commented Apr 26, 2023

StellaAthena commented Apr 26, 2023 •

edited

Loading

haileyschoelkopf commented Apr 26, 2023

peiyingxin commented Apr 27, 2023

StellaAthena commented Jun 3, 2023

when preprocess data and load data using "lazy" mode #904

when preprocess data and load data using "lazy" mode #904

Comments

peiyingxin commented Apr 25, 2023

StellaAthena commented Apr 25, 2023

peiyingxin commented Apr 26, 2023

StellaAthena commented Apr 26, 2023 • edited Loading

haileyschoelkopf commented Apr 26, 2023

peiyingxin commented Apr 27, 2023

StellaAthena commented Jun 3, 2023

StellaAthena commented Apr 26, 2023 •

edited

Loading