Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training gpt stuck at the beginning #988

Closed
jiezhangGt opened this issue Jul 5, 2023 · 3 comments
Closed

Training gpt stuck at the beginning #988

jiezhangGt opened this issue Jul 5, 2023 · 3 comments
Labels
feature request New feature or request

Comments

@jiezhangGt
Copy link

jiezhangGt commented Jul 5, 2023

I followed the readme, trained the 20B model from scratch, and then the log stopped at [Training seems to be stuck]:

gpu250: Time to load utils op: 0.003278970718383789 seconds
gpu250: [2023-07-05 15:08:29,918] [INFO] [engine.py:83:init] CONFIG: micro_batches=6 micro_batch_size=1
gpu250: [2023-07-05 15:08:30,157] [INFO] [engine.py:138:init] RANK=0 STAGE=0 LAYERS=13 [0, 13) STAGE_PARAMS=2646199296 (2646.199M) TOTAL_PARAMS=20553056256 (20553.056M) UNIQUE_PARAMS=20553056256 (20553.056M)
gpu250: [2023-07-05 15:08:30,157] [INFO] [engine.py:138:init] RANK=1 STAGE=0 LAYERS=13 [0, 13) STAGE_PARAMS=2646199296 (2646.199M) TOTAL_PARAMS=20553056256 (20553.056M) UNIQUE_PARAMS=20553056256 (20553.056M)
gpu250: [2023-07-05 15:08:30,157] [INFO] [engine.py:138:init] RANK=5 STAGE=1 LAYERS=12 [13, 25) STAGE_PARAMS=2718609408 (2718.609M) TOTAL_PARAMS=20553056256 (20553.056M) UNIQUE_PARAMS=20553056256 (20553.056M)
gpu250: [2023-07-05 15:08:30,157] [INFO] [engine.py:138:init] RANK=4 STAGE=1 LAYERS=12 [13, 25) STAGE_PARAMS=2718609408 (2718.609M) TOTAL_PARAMS=20553056256 (20553.056M) UNIQUE_PARAMS=20553056256 (20553.056M)
gpu273: [2023-07-05 15:08:30,157] [INFO] [engine.py:138:init] RANK=8 STAGE=2 LAYERS=12 [25, 37) STAGE_PARAMS=2718609408 (2718.609M) TOTAL_PARAMS=20553056256 (20553.056M) UNIQUE_PARAMS=20553056256 (20553.056M)
gpu273: [2023-07-05 15:08:30,157] [INFO] [engine.py:138:init] RANK=9 STAGE=2 LAYERS=12 [25, 37) STAGE_PARAMS=2718609408 (2718.609M) TOTAL_PARAMS=20553056256 (20553.056M) UNIQUE_PARAMS=20553056256 (20553.056M)
gpu273: [2023-07-05 15:08:30,157] [INFO] [engine.py:138:init] RANK=13 STAGE=3 LAYERS=12 [37, 49) STAGE_PARAMS=2193110016 (2193.110M) TOTAL_PARAMS=20553056256 (20553.056M) UNIQUE_PARAMS=20553056256 (20553.056M)
gpu273: [2023-07-05 15:08:30,157] [INFO] [engine.py:138:init] RANK=12 STAGE=3 LAYERS=12 [37, 49) STAGE_PARAMS=2193110016 (2193.110M) TOTAL_PARAMS=20553056256 (20553.056M) UNIQUE_PARAMS=20553056256 (20553.056M)
gpu250: > number of parameters on model parallel rank 0: 2646199296
gpu250: > number of parameters on model parallel rank 1: 2646199296
gpu273: > number of parameters on model parallel rank 0: 2193110016
gpu273: > number of parameters on model parallel rank 1: 2193110016
gpu250: > number of parameters on model parallel rank 0: 2718609408
gpu273: > number of parameters on model parallel rank 0: 2718609408
gpu250: > number of parameters on model parallel rank 1: 2718609408
gpu273: > number of parameters on model parallel rank 1: 2718609408
gpu250: > total params: 20,553,056,256
gpu250: > building train, validation, and test datasets ...
gpu250: reading sizes...
gpu250: reading pointers...
gpu250: reading document index...
gpu250: creating numpy buffer of mmap...
gpu250: creating memory view of numpy buffer...
gpu250: > dataset split:
gpu250: train:
gpu250: document indices in [0, 157218) total of 157218 documents
gpu250: validation:
gpu250: document indices in [157218, 157850) total of 632 documents
gpu250: test:
gpu250: document indices in [157850, 158008) total of 158 documents
gpu250: > WARNING: could not find index map files, building the indices on rank 0 ...
gpu250: > elapsed time to build and save doc-idx mapping (seconds): 0.101815
gpu250: using:
gpu250: number of documents: 157218
gpu250: number of epochs: 11
gpu250: sequence length: 2048
gpu250: total number of samples: 1808420
gpu250: > elapsed time to build and save sample-idx mapping (seconds): 0.053199
gpu250: > elapsed time to build and save shuffle-idx mapping (seconds): 0.078683
gpu250: > loading doc-idx mapping from /ssd10/exec/zhangjie07/2024/gpt-neox/data/data_bin_0703/jiaoyu.s3_text_document_train_indexmap_1800000ns_2048sl_1234s_doc_idx.npy
gpu250: > loading sample-idx mapping from /ssd10/exec/zhangjie07/2024/gpt-neox/data/data_bin_0703/jiaoyu.s3_text_document_train_indexmap_1800000ns_2048sl_1234s_sample_idx.npy
gpu250: > loading shuffle-idx mapping from /ssd10/exec/zhangjie07/2024/gpt-neox/data/data_bin_0703/jiaoyu.s3_text_document_train_indexmap_1800000ns_2048sl_1234s_shuffle_idx.npy
gpu250: loaded indexed file in 0.009 seconds
gpu250: total number of samples: 1808421
gpu250: total number of epochs: 11
gpu250: > WARNING: could not find index map files, building the indices on rank 0 ...
gpu250: > elapsed time to build and save doc-idx mapping (seconds): 0.005961
gpu250: using:
gpu250: number of documents: 632
gpu250: number of epochs: 33
gpu250: sequence length: 2048
gpu250: total number of samples: 18187
gpu250: > elapsed time to build and save sample-idx mapping (seconds): 0.004232
gpu250: > elapsed time to build and save shuffle-idx mapping (seconds): 0.005176
gpu250: > loading doc-idx mapping from /ssd10/exec/zhangjie07/2024/gpt-neox/data/data_bin_0703/jiaoyu.s3_text_document_valid_indexmap_18120ns_2048sl_1234s_doc_idx.npy
gpu250: > loading sample-idx mapping from /ssd10/exec/zhangjie07/2024/gpt-neox/data/data_bin_0703/jiaoyu.s3_text_document_valid_indexmap_18120ns_2048sl_1234s_sample_idx.npy
gpu250: > loading shuffle-idx mapping from /ssd10/exec/zhangjie07/2024/gpt-neox/data/data_bin_0703/jiaoyu.s3_text_document_valid_indexmap_18120ns_2048sl_1234s_shuffle_idx.npy
gpu250: loaded indexed file in 0.013 seconds
gpu250: total number of samples: 18188
gpu250: total number of epochs: 33
gpu250: > WARNING: could not find index map files, building the indices on rank 0 ...
gpu250: > elapsed time to build and save doc-idx mapping (seconds): 0.004437
gpu250: using:
gpu250: number of documents: 158
gpu250: number of epochs: 2
gpu250: sequence length: 2048
gpu250: total number of samples: 185
gpu250: > elapsed time to build and save sample-idx mapping (seconds): 0.003767
gpu250: > elapsed time to build and save shuffle-idx mapping (seconds): 0.003856
gpu250: > loading doc-idx mapping from /ssd10/exec/zhangjie07/2024/gpt-neox/data/data_bin_0703/jiaoyu.s3_text_document_test_indexmap_120ns_2048sl_1234s_doc_idx.npy
gpu250: > loading sample-idx mapping from /ssd10/exec/zhangjie07/2024/gpt-neox/data/data_bin_0703/jiaoyu.s3_text_document_test_indexmap_120ns_2048sl_1234s_sample_idx.npy
gpu250: > loading shuffle-idx mapping from /ssd10/exec/zhangjie07/2024/gpt-neox/data/data_bin_0703/jiaoyu.s3_text_document_test_indexmap_120ns_2048sl_1234s_shuffle_idx.npy
gpu250: loaded indexed file in 0.008 seconds
gpu250: total number of samples: 186
gpu250: total number of epochs: 2
gpu250: setting training data start iteration to 0
gpu250: setting validation data start iteration to 0
gpu250: done with setups ...
gpu250: time (ms) | model and optimizer: 15361.37 | train/valid/test data iterators: 674.99
gpu250: training ...

@jiezhangGt jiezhangGt added the feature request New feature or request label Jul 5, 2023
@StellaAthena StellaAthena changed the title Training got stuck at the beginning Training gpt stuck at the beginning Jul 5, 2023
@StellaAthena
Copy link
Member

Is this the same issue as #985?

@StellaAthena
Copy link
Member

@jiezhangGt Hey, wanted to follow up as I believe we have fixed this bug.

@StellaAthena
Copy link
Member

Closing as fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants