Training gpt stuck at the beginning #988

jiezhangGt · 2023-07-05T07:39:36Z

I followed the readme, trained the 20B model from scratch, and then the log stopped at [Training seems to be stuck]:

gpu250: Time to load utils op: 0.003278970718383789 seconds
gpu250: [2023-07-05 15:08:29,918] [INFO] [engine.py:83:init] CONFIG: micro_batches=6 micro_batch_size=1
gpu250: [2023-07-05 15:08:30,157] [INFO] [engine.py:138:init] RANK=0 STAGE=0 LAYERS=13 [0, 13) STAGE_PARAMS=2646199296 (2646.199M) TOTAL_PARAMS=20553056256 (20553.056M) UNIQUE_PARAMS=20553056256 (20553.056M)
gpu250: [2023-07-05 15:08:30,157] [INFO] [engine.py:138:init] RANK=1 STAGE=0 LAYERS=13 [0, 13) STAGE_PARAMS=2646199296 (2646.199M) TOTAL_PARAMS=20553056256 (20553.056M) UNIQUE_PARAMS=20553056256 (20553.056M)
gpu250: [2023-07-05 15:08:30,157] [INFO] [engine.py:138:init] RANK=5 STAGE=1 LAYERS=12 [13, 25) STAGE_PARAMS=2718609408 (2718.609M) TOTAL_PARAMS=20553056256 (20553.056M) UNIQUE_PARAMS=20553056256 (20553.056M)
gpu250: [2023-07-05 15:08:30,157] [INFO] [engine.py:138:init] RANK=4 STAGE=1 LAYERS=12 [13, 25) STAGE_PARAMS=2718609408 (2718.609M) TOTAL_PARAMS=20553056256 (20553.056M) UNIQUE_PARAMS=20553056256 (20553.056M)
gpu273: [2023-07-05 15:08:30,157] [INFO] [engine.py:138:init] RANK=8 STAGE=2 LAYERS=12 [25, 37) STAGE_PARAMS=2718609408 (2718.609M) TOTAL_PARAMS=20553056256 (20553.056M) UNIQUE_PARAMS=20553056256 (20553.056M)
gpu273: [2023-07-05 15:08:30,157] [INFO] [engine.py:138:init] RANK=9 STAGE=2 LAYERS=12 [25, 37) STAGE_PARAMS=2718609408 (2718.609M) TOTAL_PARAMS=20553056256 (20553.056M) UNIQUE_PARAMS=20553056256 (20553.056M)
gpu273: [2023-07-05 15:08:30,157] [INFO] [engine.py:138:init] RANK=13 STAGE=3 LAYERS=12 [37, 49) STAGE_PARAMS=2193110016 (2193.110M) TOTAL_PARAMS=20553056256 (20553.056M) UNIQUE_PARAMS=20553056256 (20553.056M)
gpu273: [2023-07-05 15:08:30,157] [INFO] [engine.py:138:init] RANK=12 STAGE=3 LAYERS=12 [37, 49) STAGE_PARAMS=2193110016 (2193.110M) TOTAL_PARAMS=20553056256 (20553.056M) UNIQUE_PARAMS=20553056256 (20553.056M)
gpu250: > number of parameters on model parallel rank 0: 2646199296
gpu250: > number of parameters on model parallel rank 1: 2646199296
gpu273: > number of parameters on model parallel rank 0: 2193110016
gpu273: > number of parameters on model parallel rank 1: 2193110016
gpu250: > number of parameters on model parallel rank 0: 2718609408
gpu273: > number of parameters on model parallel rank 0: 2718609408
gpu250: > number of parameters on model parallel rank 1: 2718609408
gpu273: > number of parameters on model parallel rank 1: 2718609408
gpu250: > total params: 20,553,056,256
gpu250: > building train, validation, and test datasets ...
gpu250: reading sizes...
gpu250: reading pointers...
gpu250: reading document index...
gpu250: creating numpy buffer of mmap...
gpu250: creating memory view of numpy buffer...
gpu250: > dataset split:
gpu250: train:
gpu250: document indices in [0, 157218) total of 157218 documents
gpu250: validation:
gpu250: document indices in [157218, 157850) total of 632 documents
gpu250: test:
gpu250: document indices in [157850, 158008) total of 158 documents
gpu250: > WARNING: could not find index map files, building the indices on rank 0 ...
gpu250: > elapsed time to build and save doc-idx mapping (seconds): 0.101815
gpu250: using:
gpu250: number of documents: 157218
gpu250: number of epochs: 11
gpu250: sequence length: 2048
gpu250: total number of samples: 1808420
gpu250: > elapsed time to build and save sample-idx mapping (seconds): 0.053199
gpu250: > elapsed time to build and save shuffle-idx mapping (seconds): 0.078683
gpu250: > loading doc-idx mapping from /ssd10/exec/zhangjie07/2024/gpt-neox/data/data_bin_0703/jiaoyu.s3_text_document_train_indexmap_1800000ns_2048sl_1234s_doc_idx.npy
gpu250: > loading sample-idx mapping from /ssd10/exec/zhangjie07/2024/gpt-neox/data/data_bin_0703/jiaoyu.s3_text_document_train_indexmap_1800000ns_2048sl_1234s_sample_idx.npy
gpu250: > loading shuffle-idx mapping from /ssd10/exec/zhangjie07/2024/gpt-neox/data/data_bin_0703/jiaoyu.s3_text_document_train_indexmap_1800000ns_2048sl_1234s_shuffle_idx.npy
gpu250: loaded indexed file in 0.009 seconds
gpu250: total number of samples: 1808421
gpu250: total number of epochs: 11
gpu250: > WARNING: could not find index map files, building the indices on rank 0 ...
gpu250: > elapsed time to build and save doc-idx mapping (seconds): 0.005961
gpu250: using:
gpu250: number of documents: 632
gpu250: number of epochs: 33
gpu250: sequence length: 2048
gpu250: total number of samples: 18187
gpu250: > elapsed time to build and save sample-idx mapping (seconds): 0.004232
gpu250: > elapsed time to build and save shuffle-idx mapping (seconds): 0.005176
gpu250: > loading doc-idx mapping from /ssd10/exec/zhangjie07/2024/gpt-neox/data/data_bin_0703/jiaoyu.s3_text_document_valid_indexmap_18120ns_2048sl_1234s_doc_idx.npy
gpu250: > loading sample-idx mapping from /ssd10/exec/zhangjie07/2024/gpt-neox/data/data_bin_0703/jiaoyu.s3_text_document_valid_indexmap_18120ns_2048sl_1234s_sample_idx.npy
gpu250: > loading shuffle-idx mapping from /ssd10/exec/zhangjie07/2024/gpt-neox/data/data_bin_0703/jiaoyu.s3_text_document_valid_indexmap_18120ns_2048sl_1234s_shuffle_idx.npy
gpu250: loaded indexed file in 0.013 seconds
gpu250: total number of samples: 18188
gpu250: total number of epochs: 33
gpu250: > WARNING: could not find index map files, building the indices on rank 0 ...
gpu250: > elapsed time to build and save doc-idx mapping (seconds): 0.004437
gpu250: using:
gpu250: number of documents: 158
gpu250: number of epochs: 2
gpu250: sequence length: 2048
gpu250: total number of samples: 185
gpu250: > elapsed time to build and save sample-idx mapping (seconds): 0.003767
gpu250: > elapsed time to build and save shuffle-idx mapping (seconds): 0.003856
gpu250: > loading doc-idx mapping from /ssd10/exec/zhangjie07/2024/gpt-neox/data/data_bin_0703/jiaoyu.s3_text_document_test_indexmap_120ns_2048sl_1234s_doc_idx.npy
gpu250: > loading sample-idx mapping from /ssd10/exec/zhangjie07/2024/gpt-neox/data/data_bin_0703/jiaoyu.s3_text_document_test_indexmap_120ns_2048sl_1234s_sample_idx.npy
gpu250: > loading shuffle-idx mapping from /ssd10/exec/zhangjie07/2024/gpt-neox/data/data_bin_0703/jiaoyu.s3_text_document_test_indexmap_120ns_2048sl_1234s_shuffle_idx.npy
gpu250: loaded indexed file in 0.008 seconds
gpu250: total number of samples: 186
gpu250: total number of epochs: 2
gpu250: setting training data start iteration to 0
gpu250: setting validation data start iteration to 0
gpu250: done with setups ...
gpu250: time (ms) | model and optimizer: 15361.37 | train/valid/test data iterators: 674.99
gpu250: training ...

StellaAthena · 2023-07-06T13:15:24Z

Is this the same issue as #985?

StellaAthena · 2023-07-20T15:14:27Z

@jiezhangGt Hey, wanted to follow up as I believe we have fixed this bug.

StellaAthena · 2023-07-30T20:03:03Z

Closing as fixed.

jiezhangGt added the feature request New feature or request label Jul 5, 2023

StellaAthena changed the title ~~Training got stuck at the beginning~~ Training gpt stuck at the beginning Jul 5, 2023

StellaAthena closed this as completed Jul 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training gpt stuck at the beginning #988

Training gpt stuck at the beginning #988

jiezhangGt commented Jul 5, 2023 •

edited

Loading

StellaAthena commented Jul 6, 2023

StellaAthena commented Jul 20, 2023

StellaAthena commented Jul 30, 2023

Training gpt stuck at the beginning #988

Training gpt stuck at the beginning #988

Comments

jiezhangGt commented Jul 5, 2023 • edited Loading

StellaAthena commented Jul 6, 2023

StellaAthena commented Jul 20, 2023

StellaAthena commented Jul 30, 2023

jiezhangGt commented Jul 5, 2023 •

edited

Loading