Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

when training with multi node, raise filenotfound error #919

Closed
cateto opened this issue May 4, 2023 · 2 comments
Closed

when training with multi node, raise filenotfound error #919

cateto opened this issue May 4, 2023 · 2 comments
Labels
bug Something isn't working

Comments

@cateto
Copy link

cateto commented May 4, 2023

Describe the bug

> building train, validation, and test datasets ...
ML-01:     reading sizes...
ML-01:     reading pointers...
ML-01:     reading document index...
ML-01:     creating numpy buffer of mmap...
ML-01:     creating memory view of numpy buffer...
ML-01:  > dataset split:
ML-01:     train:
ML-01:      document indices in [0, 2788588) total of 2788588 documents
ML-01:     validation:
ML-01:      document indices in [2788588, 2874922) total of 86334 documents
ML-01:     test:
ML-01:      document indices in [2874922, 2877800) total of 2878 documents
ML-01: use_shared_fs===================================================
ML-01: True
ML-01: should_process_dataset===================================================
ML-01: True
ML-01:  > WARNING: could not find index map files, building the indices on rank 0 ...
ML-01:  > elapsed time to build and save doc-idx mapping (seconds): 840.129082
ML-01:     using:
ML-01:      number of documents:       2788588
ML-01:      number of epochs:          3613
ML-01:      sequence length:           2048
ML-01:      total number of samples:   576123680
ML-01:  > elapsed time to build and save sample-idx mapping (seconds): 39.764219
ML-01:  > elapsed time to build and save shuffle-idx mapping (seconds): 27.045912
ML-02: Traceback (most recent call last):
ML-02:   File "train.py", line 27, in <module>
ML-02:     pretrain(neox_args=neox_args)
ML-02:   File "/home/mlusers/gpt-neox/megatron/training.py", line 203, in pretrain
ML-02:     ) = build_train_valid_test_data_iterators(neox_args=neox_args)
ML-02:   File "/home/mlusers/gpt-neox/megatron/data/data_utils.py", line 400, in build_train_valid_test_data_iterators
ML-02:     train_ds, valid_ds, test_ds = build_train_valid_test_datasets(
ML-02:   File "/home/mlusers/gpt-neox/megatron/data/data_utils.py", line 139, in build_train_valid_test_datasets
ML-02:     train_dataset = build_dataset(0, "train")
ML-02:   File "/home/mlusers/gpt-neox/megatron/data/data_utils.py", line 127, in build_dataset
ML-02:     dataset = GPT2Dataset(
ML-02:   File "/home/mlusers/gpt-neox/megatron/data/gpt2_dataset.py", line 52, in __init__
ML-02:     self.doc_idx, self.sample_idx, self.shuffle_idx = _build_index_mappings(
ML-02:   File "/home/mlusers/gpt-neox/megatron/data/gpt2_dataset.py", line 215, in _build_index_mappings
ML-02:     doc_idx = np.load(doc_idx_filename, allow_pickle=True, mmap_mode="r")
ML-02:   File "/home/mlusers/.conda/envs/gpt_neox_py38/lib/python3.8/site-packages/numpy/lib/npyio.py", line 405, in load
ML-02:     fid = stack.enter_context(open(os_fspath(file), "rb"))
ML-02: FileNotFoundError: [Errno 2] No such file or directory: 'data/wiki/wiki_text_document_train_indexmap_576000000ns_2048sl_1234s_doc_idx.npy'

Environment (please complete the following information):

  • GPUs: 4 GPUS , A100
  • Configs: multi-node

Additional context

In multi node training,
train done, validation done.
but test data split doesn't work.

@cateto cateto added the bug Something isn't working label May 4, 2023
@cateto cateto changed the title when training with multi node when training with multi node, raise data split error May 4, 2023
@cateto cateto changed the title when training with multi node, raise data split error when training with multi node, raise filenotfound error May 4, 2023
@cateto
Copy link
Author

cateto commented May 9, 2023

I solved it.
should i copy tomegatron_config.json and train data of other node??
i think shared memory doesn't work.

@StellaAthena
Copy link
Member

Closing as a duplicate of #925

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants