-
Notifications
You must be signed in to change notification settings - Fork 982
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Skipped 50 iterations in a row due to Overflow - Exiting training. #482
Comments
I ran into a similar issue in the past. Are the model weights in the latest checkpoint very large? It might be that the weights are shooting way up and causing overflows. Adding L2 regularization can help prevent this. In small.yml, you can change line 56 to: "weight-decay": 0.01 I'm not sure if this will override the value in the checkpoint though. One quick test would be to try a negative weight decay and see if it throws an error. If it does, then you'll know it's reading the weight decay from the updated config and not the checkpoint. Edit: Did some testing on my end. I got this error when changing the
TracebackNeoXArgs.from_ymls() ['configs/small.yml', 'configs/local_setup.yml'] INFO:root:NeoXArgs.calculate_derived() Total number of GPUs determined to be: 4 -------------------- arguments -------------------- attention_config ................ ['global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global']updated attention_dropout ............... 0.0.........................updated batch_size ...................... 4...........................updated checkpoint_activations .......... True........................updated clip_grad ....................... 1.0.........................updated config_files .................... {'small.yml': '# GPT-2 pretraining setup\n{\n # parallelism settings ( you will want to change these based on your cluster setup, ideally scheduling pipeline stages\n # across the node boundaries )\n "pipe-parallel-size": 1,\n "model-parallel-size": 1,\n\n # model settings\n "num-layers": 12,\n "hidden-size": 768,\n "num-attention-heads": 12,\n "seq-length": 2048,\n "max-position-embeddings": 2048,\n "norm": "layernorm",\n "pos-emb": "rotary",\n "no-weight-tying": true,\n\n # these should provide some speedup but takes a while to build, set to true if desired\n "scaled-upper-triang-masked-softmax-fusion": false,\n "bias-gelu-fusion": false,\n\n\n # optimizer settings\n "optimizer": {\n "type": "Adam",\n "params": {\n "lr": 0.0006,\n "betas": [0.9, 0.999],\n "eps": 1.0e-8,\n }\n },\n "zero_optimization": {\n "stage": 0,\n "allgather_partitions": True,\n "allgather_bucket_size": 500000000,\n "overlap_comm": True,\n "reduce_scatter": True,\n "reduce_bucket_size": 500000000,\n "contiguous_gradients": True,\n "cpu_offload": False\n },\n\n # batch / data settings\n "train_micro_batch_size_per_gpu": 4,\n "data-impl": "mmap",\n "split": "949,50,1",\n\n # activation checkpointing\n "checkpoint-activations": true,\n "checkpoint-num-layers": 1,\n "partition-activations": true,\n "synchronize-each-layer": true,\n\n # regularization\n "gradient_clipping": 1.0,\n "weight-decay": 0.5,\n "hidden-dropout": 0.0,\n "attention-dropout": 0.0,\n\n # "no-load-optim": false,\n\n # precision settings\n "fp16": { \n "enabled": true,\n "loss_scale": 0,\n "loss_scale_window": 1000,\n "hysteresis": 2,\n "min_loss_scale": 1\n },\n\n # misc. training settings\n "train-iters": 320000,\n "lr-decay-iters": 320000,\n "distributed-backend": "nccl",\n "lr-decay-style": "cosine",\n "warmup": 0.01,\n "save-interval": 10,\n "eval-interval": 1000,\n "eval-iters": 10,\n\n # logging\n "log-interval": 100,\n "steps_per_print": 10,\n "keep-last-n-checkpoints": 4,\n "wall_clock_breakdown": true,\n}\n', 'local_setup.yml': '# Suggested data paths when using GPT-NeoX locally\n{\n "data-path": "data/enron/enron_text_document",\n \n # or for weighted datasets: \n # "train-data-paths": ["data/enron/enron_text_document", "data/enron/enron_text_document"],\n # "test-data-paths": ["data/enron/enron_text_document", "data/enron/enron_text_document"],\n # "valid-data-paths": ["data/enron/enron_text_document", "data/enron/enron_text_document"],\n # "train-data-weights": [1., 2.],\n # "test-data-weights": [2., 1.],\n # "valid-data-weights": [0.5, 0.4],\n\n # If weight_by_num_documents is True, Builds dataset weights from a multinomial distribution over groups of data according to the number of documents in each group. \n # WARNING: setting this to True will override any user provided weights\n # "weight_by_num_documents": false,\n # "weighted_sampler_alpha": 0.3,\n\n "vocab-file": "data/gpt2-vocab.json",\n "merge-file": "data/gpt2-merges.txt",\n\n "save": "checkpoints",\n "load": "checkpoints",\n "checkpoint_validation_with_forward_pass": False,\n \n "tensorboard-dir": "tensorboard",\n "log-dir": "logs",\n "use_wandb": False,\n "wandb_host": "https://api.wandb.ai",\n "wandb_project": "neox"\n}\n'}updated data_impl ....................... mmap........................updated data_path ....................... data/enron/enron_text_documentupdated dynamic_loss_scale .............. True........................updated eval_iters ...................... 10..........................updated fp16 ............................ {'enabled': True, 'loss_scale': 0, 'loss_scale_window': 1000, 'hysteresis': 2, 'min_loss_scale': 1}updated gas ............................. 1...........................updated global_num_gpus ................. 4...........................updated gradient_clipping ............... 1.0.........................updated hidden_dropout .................. 0.0.........................updated hidden_size ..................... 768.........................updated is_pipe_parallel ................ True........................updated keep_last_n_checkpoints ......... 4...........................updated load ............................ checkpoints.................updated log_dir ......................... logs........................updated log_interval .................... 100.........................updated lr .............................. 0.0006......................updated lr_decay_iters .................. 320000......................updated lr_decay_style .................. cosine......................updated max_position_embeddings ......... 2048........................updated merge_file ...................... data/gpt2-merges.txt........updated no_weight_tying ................. True........................updated num_attention_heads ............. 12..........................updated num_layers ...................... 12..........................updated optimizer ....................... {'type': 'Adam', 'params': {'lr': 0.0006, 'betas': [0.9, 0.999], 'eps': 1e-08}}updated optimizer_type .................. Adam........................updated partition_activations ........... True........................updated pipe_parallel_size .............. 1...........................updated pos_emb ......................... rotary......................updated precision ....................... fp16........................updated save ............................ checkpoints.................updated save_interval ................... 10..........................updated seq_length ...................... 2048........................updated sparsity_config ................. {}..........................updated split ........................... 949,50,1....................updated synchronize_each_layer .......... True........................updated tensorboard_dir ................. tensorboard.................updated train_batch_size ................ 16..........................updated train_iters ..................... 320000......................updated train_micro_batch_size_per_gpu .. 4...........................updated use_wandb ....................... False.......................updated user_script ..................... train.py....................updated vocab_file ...................... data/gpt2-vocab.json........updated wall_clock_breakdown ............ True........................updated wandb_group ..................... PGUomanNCJsAJwjxHxxMQH_23td7yzpupdated weight_decay .................... 0.5.........................updated zero_allgather_bucket_size ...... 500000000...................updated zero_contiguous_gradients ....... True........................updated zero_optimization ............... {'stage': 0, 'allgather_partitions': True, 'allgather_bucket_size': 500000000, 'overlap_comm': True, 'reduce_scatter': True, 'reduce_bucket_size': 500000000, 'contiguous_gradients': True, 'cpu_offload': False}updated zero_reduce_bucket_size ......... 500000000...................updated zero_reduce_scatter ............. True........................updated zero_stage ...................... 0...........................updated activation ...................... gelu........................default adlr_autoresume ................. False.......................default adlr_autoresume_interval ........ 1000........................default amp ............................. None........................default apply_query_key_layer_scaling ... False.......................default attention_softmax_in_fp32 ....... False.......................default bias_dropout_fusion ............. False.......................default bias_gelu_fusion ................ False.......................default char_level_ppl .................. False.......................default checkpoint_in_cpu ............... False.......................default checkpoint_num_layers ........... 1...........................default checkpoint_validation_with_forward_pass False................default contiguous_checkpointing ........ False.......................default deepscale ....................... False.......................default deepscale_config ................ None........................default deepspeed ....................... True........................default deepspeed_activation_checkpointing True......................default deepspeed_mpi ................... False.......................default detect_nvlink_pairs ............. False.......................default distributed_backend ............. nccl........................default do_test ......................... None........................default do_train ........................ None........................default do_valid ........................ None........................default dump_state ...................... False.......................default eod_mask_loss ................... False.......................default eval_interval ................... 1000........................default eval_results_prefix ............. ............................default eval_tasks ...................... None........................default exclude ......................... None........................default exit_interval ................... None........................default finetune ........................ False.......................default flops_profiler .................. None........................default fp16_lm_cross_entropy ........... False.......................default fp32_allreduce .................. False.......................default git_hash ........................ 49e60fe.....................default gmlp_attn_dim ................... 64..........................default gpt_j_residual .................. False.......................default gradient_accumulation_steps ..... 1...........................default gradient_noise_scale_cpu_offload False.......................default gradient_noise_scale_n_batches .. 5...........................default gradient_predivide_factor ....... 1.0.........................default hostfile ........................ None........................default hysteresis ...................... 2...........................default include ......................... None........................default init_method ..................... normal......................default init_method_std ................. 0.02........................default iteration ....................... None........................default launcher ........................ pdsh........................default layernorm_epsilon ............... 1e-05.......................default lazy_mpu_init ................... False.......................default local_rank ...................... None........................default log_grad_norm ................... False.......................default log_gradient_noise_scale ........ False.......................default log_optimizer_states ............ False.......................default log_param_norm .................. False.......................default loss_scale ...................... None........................default loss_scale_window ............... 1000.0......................default make_vocab_size_divisible_by .... 128.........................default master_addr ..................... None........................default master_port ..................... 29500.......................default maximum_tokens .................. 64..........................default min_lr .......................... 0.0.........................default min_scale ....................... 1.0.........................default mmap_warmup ..................... False.......................default model_parallel_size ............. 1...........................default no_load_optim ................... False.......................default no_load_rng ..................... False.......................default no_save_optim ................... False.......................default no_save_rng ..................... False.......................default norm ............................ layernorm...................default num_gpus ........................ None........................default num_nodes ....................... -1..........................default num_samples ..................... 0...........................default num_unique_layers ............... None........................default num_workers ..................... 2...........................default onnx_safe ....................... False.......................default output_layer_init_method ........ scaled_normal...............default output_layer_parallelism ........ row.........................default override_lr_scheduler ........... False.......................default padded_vocab_size ............... None........................default param_sharing_style ............. grouped.....................default pipe_partition_method ........... type:transformer|mlp........default prescale_gradients .............. False.......................default profile_backward ................ False.......................default rank ............................ None........................default recompute ....................... False.......................default rms_norm_epsilon ................ 1e-08.......................default rotary_emb_base ................. 10000.......................default rotary_pct ...................... 1.0.........................default rpe_max_distance ................ 128.........................default rpe_num_buckets ................. 32..........................default sample_input_file ............... None........................default sample_output_file .............. None........................default scaled_masked_softmax_fusion .... False.......................default scaled_upper_triang_masked_softmax_fusion False..............default scalenorm_epsilon ............... 1e-08.......................default scheduler ....................... None........................default seed ............................ 1234........................default short_seq_prob .................. 0.1.........................default soft_prompt_tuning .............. None........................default sparse_gradients ................ False.......................default steps_per_print ................. 10..........................default temperature ..................... 0.0.........................default test_data_paths ................. None........................default test_data_weights ............... None........................default text_gen_type ................... None........................default tokenizer_type .................. GPT2BPETokenizer............default top_k ........................... 0...........................default top_p ........................... 0.0.........................default train_data_paths ................ None........................default train_data_weights .............. None........................default use_bnb_optimizer ............... False.......................default use_checkpoint_lr_scheduler ..... False.......................default use_cpu_initialization .......... False.......................default valid_data_paths ................ None........................default valid_data_weights .............. None........................default wandb_host ...................... https://api.wandb.ai........default wandb_project ................... neox........................default wandb_team ...................... None........................default warmup .......................... 0.01........................default weight_by_num_documents ......... False.......................default weighted_sampler_alpha .......... 0.3.........................default world_size ...................... None........................default zero_allow_untested_optimizer ... False.......................default ---------------- end of arguments ---------------- [2021-12-16 15:58:05,404] [WARNING] [runner.py:126:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2021-12-16 15:58:05,405] [INFO] [runner.py:366:main] cmd = /mnt/workspace/venv/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=29500 train.py --deepspeed_config {"train_batch_size": 16, "train_micro_batch_size_per_gpu": 4, "optimizer": {"type": "Adam", "params": {"lr": 0.0006, "betas": [0.9, 0.999], "eps": 1e-08}}, "fp16": {"enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1}, "gradient_clipping": 1.0, "zero_optimization": {"stage": 0, "allgather_partitions": true, "allgather_bucket_size": 500000000, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 500000000, "contiguous_gradients": true, "cpu_offload": false}, "wall_clock_breakdown": true} --megatron_config {"train_batch_size": 16, "train_micro_batch_size_per_gpu": 4, "optimizer": {"type": "Adam", "params": {"lr": 0.0006, "betas": [0.9, 0.999], "eps": 1e-08}}, "fp16": {"enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1}, "gradient_clipping": 1.0, "zero_optimization": {"stage": 0, "allgather_partitions": true, "allgather_bucket_size": 500000000, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 500000000, "contiguous_gradients": true, "cpu_offload": false}, "wall_clock_breakdown": true, "precision": "fp16", "num_layers": 12, "hidden_size": 768, "num_attention_heads": 12, "seq_length": 2048, "max_position_embeddings": 2048, "pos_emb": "rotary", "no_weight_tying": true, "attention_config": ["global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global"], "sparsity_config": {}, "lr_decay_style": "cosine", "lr_decay_iters": 320000, "optimizer_type": "Adam", "zero_stage": 0, "zero_reduce_scatter": true, "zero_contiguous_gradients": true, "zero_reduce_bucket_size": 500000000, "zero_allgather_bucket_size": 500000000, "lr": 0.0006, "data_path": "data/enron/enron_text_document", "data_impl": "mmap", "save": "checkpoints", "config_files": {"small.yml": "# GPT-2 pretraining setup\n{\n # parallelism settings ( you will want to change these based on your cluster setup, ideally scheduling pipeline stages\n # across the node boundaries )\n \"pipe-parallel-size\": 1,\n \"model-parallel-size\": 1,\n\n # model settings\n \"num-layers\": 12,\n \"hidden-size\": 768,\n \"num-attention-heads\": 12,\n \"seq-length\": 2048,\n \"max-position-embeddings\": 2048,\n \"norm\": \"layernorm\",\n \"pos-emb\": \"rotary\",\n \"no-weight-tying\": true,\n\n # these should provide some speedup but takes a while to build, set to true if desired\n \"scaled-upper-triang-masked-softmax-fusion\": false,\n \"bias-gelu-fusion\": false,\n\n\n # optimizer settings\n \"optimizer\": {\n \"type\": \"Adam\",\n \"params\": {\n \"lr\": 0.0006,\n \"betas\": [0.9, 0.999],\n \"eps\": 1.0e-8,\n }\n },\n \"zero_optimization\": {\n \"stage\": 0,\n \"allgather_partitions\": True,\n \"allgather_bucket_size\": 500000000,\n \"overlap_comm\": True,\n \"reduce_scatter\": True,\n \"reduce_bucket_size\": 500000000,\n \"contiguous_gradients\": True,\n \"cpu_offload\": False\n },\n\n # batch / data settings\n \"train_micro_batch_size_per_gpu\": 4,\n \"data-impl\": \"mmap\",\n \"split\": \"949,50,1\",\n\n # activation checkpointing\n \"checkpoint-activations\": true,\n \"checkpoint-num-layers\": 1,\n \"partition-activations\": true,\n \"synchronize-each-layer\": true,\n\n # regularization\n \"gradient_clipping\": 1.0,\n \"weight-decay\": 0.5,\n \"hidden-dropout\": 0.0,\n \"attention-dropout\": 0.0,\n\n # \"no-load-optim\": false,\n\n # precision settings\n \"fp16\": { \n \"enabled\": true,\n \"loss_scale\": 0,\n \"loss_scale_window\": 1000,\n \"hysteresis\": 2,\n \"min_loss_scale\": 1\n },\n\n # misc. training settings\n \"train-iters\": 320000,\n \"lr-decay-iters\": 320000,\n \"distributed-backend\": \"nccl\",\n \"lr-decay-style\": \"cosine\",\n \"warmup\": 0.01,\n \"save-interval\": 10,\n \"eval-interval\": 1000,\n \"eval-iters\": 10,\n\n # logging\n \"log-interval\": 100,\n \"steps_per_print\": 10,\n \"keep-last-n-checkpoints\": 4,\n \"wall_clock_breakdown\": true,\n}\n", "local_setup.yml": "# Suggested data paths when using GPT-NeoX locally\n{\n \"data-path\": \"data/enron/enron_text_document\",\n \n # or for weighted datasets: \n # \"train-data-paths\": [\"data/enron/enron_text_document\", \"data/enron/enron_text_document\"],\n # \"test-data-paths\": [\"data/enron/enron_text_document\", \"data/enron/enron_text_document\"],\n # \"valid-data-paths\": [\"data/enron/enron_text_document\", \"data/enron/enron_text_document\"],\n # \"train-data-weights\": [1., 2.],\n # \"test-data-weights\": [2., 1.],\n # \"valid-data-weights\": [0.5, 0.4],\n\n # If weight_by_num_documents is True, Builds dataset weights from a multinomial distribution over groups of data according to the number of documents in each group. \n # WARNING: setting this to True will override any user provided weights\n # \"weight_by_num_documents\": false,\n # \"weighted_sampler_alpha\": 0.3,\n\n \"vocab-file\": \"data/gpt2-vocab.json\",\n \"merge-file\": \"data/gpt2-merges.txt\",\n\n \"save\": \"checkpoints\",\n \"load\": \"checkpoints\",\n \"checkpoint_validation_with_forward_pass\": False,\n \n \"tensorboard-dir\": \"tensorboard\",\n \"log-dir\": \"logs\",\n \"use_wandb\": False,\n \"wandb_host\": \"https://api.wandb.ai\",\n \"wandb_project\": \"neox\"\n}\n"}, "load": "checkpoints", "save_interval": 10, "batch_size": 4, "train_iters": 320000, "eval_iters": 10, "keep_last_n_checkpoints": 4, "split": "949,50,1", "vocab_file": "data/gpt2-vocab.json", "merge_file": "data/gpt2-merges.txt", "attention_dropout": 0.0, "hidden_dropout": 0.0, "weight_decay": 0.5, "checkpoint_activations": true, "synchronize_each_layer": true, "partition_activations": true, "gas": 1, "clip_grad": 1.0, "dynamic_loss_scale": true, "pipe_parallel_size": 1, "is_pipe_parallel": true, "use_wandb": false, "wandb_group": "PGUomanNCJsAJwjxHxxMQH_23td7yzp", "log_dir": "logs", "tensorboard_dir": "tensorboard", "log_interval": 100, "user_script": "train.py", "global_num_gpus": 4} [2021-12-16 15:58:06,363] [INFO] [launch.py:75:main] 0 NV_LIBNCCL_DEV_PACKAGE libnccl-dev=2.8.4-1+cuda11.1 [2021-12-16 15:58:06,363] [INFO] [launch.py:75:main] 0 NCCL_VERSION 2.8.4-1 [2021-12-16 15:58:06,363] [INFO] [launch.py:75:main] 0 NV_LIBNCCL_PACKAGE_VERSION 2.8.4-1 [2021-12-16 15:58:06,363] [INFO] [launch.py:75:main] 0 NV_LIBNCCL_PACKAGE libnccl2=2.8.4-1+cuda11.1 [2021-12-16 15:58:06,363] [INFO] [launch.py:75:main] 0 NV_LIBNCCL_DEV_PACKAGE_NAME libnccl-dev [2021-12-16 15:58:06,363] [INFO] [launch.py:75:main] 0 NV_LIBNCCL_PACKAGE_NAME libnccl2 [2021-12-16 15:58:06,363] [INFO] [launch.py:75:main] 0 NV_LIBNCCL_DEV_PACKAGE_VERSION 2.8.4-1 [2021-12-16 15:58:06,363] [INFO] [launch.py:82:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]} [2021-12-16 15:58:06,363] [INFO] [launch.py:88:main] nnodes=1, num_local_procs=4, node_rank=0 [2021-12-16 15:58:06,363] [INFO] [launch.py:103:main] global_rank_mapping=defaultdict(, {'localhost': [0, 1, 2, 3]}) [2021-12-16 15:58:06,363] [INFO] [launch.py:104:main] dist_world_size=4 [2021-12-16 15:58:06,363] [INFO] [launch.py:112:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3 NeoXArgs.configure_distributed_args() using world size: 4 and model-parallel size: 1 > building GPT2BPETokenizer tokenizer ... > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304) [2021-12-16 15:58:09,760] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl > setting tensorboard ... > initializing torch distributed ... [2021-12-16 15:58:09,776] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl [2021-12-16 15:58:09,779] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl [2021-12-16 15:58:09,823] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl > initializing model parallel with size 1 MPU DP: [0, 1, 2, 3] MPU PP: [0] MPU PP: [1] MPU PP: [2] MPU PP: [3] MPU MP: [0] MPU MP: [1] MPU MP: [2] MPU MP: [3] > setting random seeds to 1234 ... [2021-12-16 15:58:10,838] [INFO] [checkpointing.py:223:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234 make: Entering directory '/mnt/workspace/gpt-neox/megatron/data' make: Nothing to be done for 'default'. make: Leaving directory '/mnt/workspace/gpt-neox/megatron/data' building GPT2 model ... SEED_LAYERS=False BASE_SEED=1234 SEED_FN=None Using topology: {ProcessCoord(pipe=0, data=0, model=0): 0, ProcessCoord(pipe=0, data=1, model=0): 1, ProcessCoord(pipe=0, data=2, model=0): 2, ProcessCoord(pipe=0, data=3, model=0): 3} [2021-12-16 15:58:15,327] [INFO] [module.py:363:_partition_layers] Partitioning pipeline stages with method type:transformer|mlp stage=0 layers=17 0: EmbeddingPipe 1: _pre_transformer_block 2: ParallelTransformerLayerPipe 3: ParallelTransformerLayerPipe 4: ParallelTransformerLayerPipe 5: ParallelTransformerLayerPipe 6: ParallelTransformerLayerPipe 7: ParallelTransformerLayerPipe 8: ParallelTransformerLayerPipe 9: ParallelTransformerLayerPipe 10: ParallelTransformerLayerPipe 11: ParallelTransformerLayerPipe 12: ParallelTransformerLayerPipe 13: ParallelTransformerLayerPipe 14: _post_transformer_block 15: NormPipe 16: ParallelLinearPipe loss: partial [2021-12-16 15:58:15,356] [WARNING] [config.py:77:_sanity_check] DeepSpeedConfig: cpu_offload is deprecated. Please use offload_optimizer. Configuring Optimizer type: Adam with params: {'lr': 0.0006, 'betas': [0.9, 0.999], 'eps': 1e-08} > learning rate decay style: cosine [2021-12-16 15:58:15,358] [WARNING] [config.py:77:_sanity_check] DeepSpeedConfig: cpu_offload is deprecated. Please use offload_optimizer. DeepSpeed is enabled. [2021-12-16 15:58:15,359] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed info: version=0.3.15+eb7f5cf, git-hash=eb7f5cf, git-branch=main [2021-12-16 15:58:15,359] [WARNING] [config.py:77:_sanity_check] DeepSpeedConfig: cpu_offload is deprecated. Please use offload_optimizer. [2021-12-16 15:58:15,361] [WARNING] [config.py:77:_sanity_check] DeepSpeedConfig: cpu_offload is deprecated. Please use offload_optimizer. [2021-12-16 15:58:16,710] [INFO] [engine.py:654:_configure_optimizer] Removing param_group that has no 'params' in the client Optimizer [2021-12-16 15:58:16,710] [INFO] [engine.py:659:_configure_optimizer] Using client Optimizer as basic optimizer [2021-12-16 15:58:16,710] [INFO] [engine.py:668:_configure_optimizer] DeepSpeed Basic Optimizer = FusedAdam [2021-12-16 15:58:16,710] [INFO] [logging.py:60:log_dist] [Rank 0] Creating fp16 optimizer with dynamic loss scale [2021-12-16 15:58:16,751] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed Final Optimizer = FusedAdam [2021-12-16 15:58:16,751] [INFO] [engine.py:498:_configure_lr_scheduler] DeepSpeed using client LR scheduler [2021-12-16 15:58:16,751] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed LR Scheduler = [2021-12-16 15:58:16,751] [INFO] [logging.py:60:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0, 0.0], mom=[[0.9, 0.999], [0.9, 0.999]] [2021-12-16 15:58:16,751] [INFO] [config.py:759:print] DeepSpeedEngine configuration: [2021-12-16 15:58:16,752] [INFO] [config.py:763:print] activation_checkpointing_config { "partition_activations": false, "contiguous_memory_optimization": false, "cpu_checkpointing": false, "number_checkpoints": null, "synchronize_checkpoint_boundary": false, "profile": false } [2021-12-16 15:58:16,752] [INFO] [config.py:763:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True} [2021-12-16 15:58:16,752] [INFO] [config.py:763:print] allreduce_always_fp32 ........ False [2021-12-16 15:58:16,752] [INFO] [config.py:763:print] amp_enabled .................. False [2021-12-16 15:58:16,752] [INFO] [config.py:763:print] amp_params ................... False Using /root/.cache/torch_extensions as PyTorch extensions root...Using /root/.cache/torch_extensions as PyTorch extensions root...[2021-12-16 15:58:16,752] [INFO] [config.py:763:print] checkpoint_tag_validation_enabled True So I tried adding this to the config: "no-load-optim": true That resulted in:
TracebackNeoXArgs.from_ymls() ['configs/small.yml', 'configs/local_setup.yml'] INFO:root:NeoXArgs.calculate_derived() Total number of GPUs determined to be: 4 -------------------- arguments -------------------- attention_config ................ ['global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global']updated attention_dropout ............... 0.0.........................updated batch_size ...................... 4...........................updated checkpoint_activations .......... True........................updated clip_grad ....................... 1.0.........................updated config_files .................... {'small.yml': '# GPT-2 pretraining setup\n{\n # parallelism settings ( you will want to change these based on your cluster setup, ideally scheduling pipeline stages\n # across the node boundaries )\n "pipe-parallel-size": 1,\n "model-parallel-size": 1,\n\n # model settings\n "num-layers": 12,\n "hidden-size": 768,\n "num-attention-heads": 12,\n "seq-length": 2048,\n "max-position-embeddings": 2048,\n "norm": "layernorm",\n "pos-emb": "rotary",\n "no-weight-tying": true,\n\n # these should provide some speedup but takes a while to build, set to true if desired\n "scaled-upper-triang-masked-softmax-fusion": false,\n "bias-gelu-fusion": false,\n\n\n # optimizer settings\n "optimizer": {\n "type": "Adam",\n "params": {\n "lr": 0.0006,\n "betas": [0.9, 0.999],\n "eps": 1.0e-8,\n }\n },\n "zero_optimization": {\n "stage": 0,\n "allgather_partitions": True,\n "allgather_bucket_size": 500000000,\n "overlap_comm": True,\n "reduce_scatter": True,\n "reduce_bucket_size": 500000000,\n "contiguous_gradients": True,\n "cpu_offload": False\n },\n\n # batch / data settings\n "train_micro_batch_size_per_gpu": 4,\n "data-impl": "mmap",\n "split": "949,50,1",\n\n # activation checkpointing\n "checkpoint-activations": true,\n "checkpoint-num-layers": 1,\n "partition-activations": true,\n "synchronize-each-layer": true,\n\n # regularization\n "gradient_clipping": 1.0,\n "weight-decay": 0.5,\n "hidden-dropout": 0.0,\n "attention-dropout": 0.0,\n\n "no-load-optim": true,\n\n # precision settings\n "fp16": { \n "enabled": true,\n "loss_scale": 0,\n "loss_scale_window": 1000,\n "hysteresis": 2,\n "min_loss_scale": 1\n },\n\n # misc. training settings\n "train-iters": 320000,\n "lr-decay-iters": 320000,\n "distributed-backend": "nccl",\n "lr-decay-style": "cosine",\n "warmup": 0.01,\n "save-interval": 10,\n "eval-interval": 1000,\n "eval-iters": 10,\n\n # logging\n "log-interval": 100,\n "steps_per_print": 10,\n "keep-last-n-checkpoints": 4,\n "wall_clock_breakdown": true,\n}\n', 'local_setup.yml': '# Suggested data paths when using GPT-NeoX locally\n{\n "data-path": "data/enron/enron_text_document",\n \n # or for weighted datasets: \n # "train-data-paths": ["data/enron/enron_text_document", "data/enron/enron_text_document"],\n # "test-data-paths": ["data/enron/enron_text_document", "data/enron/enron_text_document"],\n # "valid-data-paths": ["data/enron/enron_text_document", "data/enron/enron_text_document"],\n # "train-data-weights": [1., 2.],\n # "test-data-weights": [2., 1.],\n # "valid-data-weights": [0.5, 0.4],\n\n # If weight_by_num_documents is True, Builds dataset weights from a multinomial distribution over groups of data according to the number of documents in each group. \n # WARNING: setting this to True will override any user provided weights\n # "weight_by_num_documents": false,\n # "weighted_sampler_alpha": 0.3,\n\n "vocab-file": "data/gpt2-vocab.json",\n "merge-file": "data/gpt2-merges.txt",\n\n "save": "checkpoints",\n "load": "checkpoints",\n "checkpoint_validation_with_forward_pass": False,\n \n "tensorboard-dir": "tensorboard",\n "log-dir": "logs",\n "use_wandb": False,\n "wandb_host": "https://api.wandb.ai",\n "wandb_project": "neox"\n}\n'}updated data_impl ....................... mmap........................updated data_path ....................... data/enron/enron_text_documentupdated dynamic_loss_scale .............. True........................updated eval_iters ...................... 10..........................updated fp16 ............................ {'enabled': True, 'loss_scale': 0, 'loss_scale_window': 1000, 'hysteresis': 2, 'min_loss_scale': 1}updated gas ............................. 1...........................updated global_num_gpus ................. 4...........................updated gradient_clipping ............... 1.0.........................updated hidden_dropout .................. 0.0.........................updated hidden_size ..................... 768.........................updated is_pipe_parallel ................ True........................updated keep_last_n_checkpoints ......... 4...........................updated load ............................ checkpoints.................updated log_dir ......................... logs........................updated log_interval .................... 100.........................updated lr .............................. 0.0006......................updated lr_decay_iters .................. 320000......................updated lr_decay_style .................. cosine......................updated max_position_embeddings ......... 2048........................updated merge_file ...................... data/gpt2-merges.txt........updated no_load_optim ................... True........................updated no_weight_tying ................. True........................updated num_attention_heads ............. 12..........................updated num_layers ...................... 12..........................updated optimizer ....................... {'type': 'Adam', 'params': {'lr': 0.0006, 'betas': [0.9, 0.999], 'eps': 1e-08}}updated optimizer_type .................. Adam........................updated partition_activations ........... True........................updated pipe_parallel_size .............. 1...........................updated pos_emb ......................... rotary......................updated precision ....................... fp16........................updated save ............................ checkpoints.................updated save_interval ................... 10..........................updated seq_length ...................... 2048........................updated sparsity_config ................. {}..........................updated split ........................... 949,50,1....................updated synchronize_each_layer .......... True........................updated tensorboard_dir ................. tensorboard.................updated train_batch_size ................ 16..........................updated train_iters ..................... 320000......................updated train_micro_batch_size_per_gpu .. 4...........................updated use_wandb ....................... False.......................updated user_script ..................... train.py....................updated vocab_file ...................... data/gpt2-vocab.json........updated wall_clock_breakdown ............ True........................updated wandb_group ..................... Mf6keriq8236v5DsKyvGvG_fxq3a3ahupdated weight_decay .................... 0.5.........................updated zero_allgather_bucket_size ...... 500000000...................updated zero_contiguous_gradients ....... True........................updated zero_optimization ............... {'stage': 0, 'allgather_partitions': True, 'allgather_bucket_size': 500000000, 'overlap_comm': True, 'reduce_scatter': True, 'reduce_bucket_size': 500000000, 'contiguous_gradients': True, 'cpu_offload': False}updated zero_reduce_bucket_size ......... 500000000...................updated zero_reduce_scatter ............. True........................updated zero_stage ...................... 0...........................updated activation ...................... gelu........................default adlr_autoresume ................. False.......................default adlr_autoresume_interval ........ 1000........................default amp ............................. None........................default apply_query_key_layer_scaling ... False.......................default attention_softmax_in_fp32 ....... False.......................default bias_dropout_fusion ............. False.......................default bias_gelu_fusion ................ False.......................default char_level_ppl .................. False.......................default checkpoint_in_cpu ............... False.......................default checkpoint_num_layers ........... 1...........................default checkpoint_validation_with_forward_pass False................default contiguous_checkpointing ........ False.......................default deepscale ....................... False.......................default deepscale_config ................ None........................default deepspeed ....................... True........................default deepspeed_activation_checkpointing True......................default deepspeed_mpi ................... False.......................default detect_nvlink_pairs ............. False.......................default distributed_backend ............. nccl........................default do_test ......................... None........................default do_train ........................ None........................default do_valid ........................ None........................default dump_state ...................... False.......................default eod_mask_loss ................... False.......................default eval_interval ................... 1000........................default eval_results_prefix ............. ............................default eval_tasks ...................... None........................default exclude ......................... None........................default exit_interval ................... None........................default finetune ........................ False.......................default flops_profiler .................. None........................default fp16_lm_cross_entropy ........... False.......................default fp32_allreduce .................. False.......................default git_hash ........................ 49e60fe.....................default gmlp_attn_dim ................... 64..........................default gpt_j_residual .................. False.......................default gradient_accumulation_steps ..... 1...........................default gradient_noise_scale_cpu_offload False.......................default gradient_noise_scale_n_batches .. 5...........................default gradient_predivide_factor ....... 1.0.........................default hostfile ........................ None........................default hysteresis ...................... 2...........................default include ......................... None........................default init_method ..................... normal......................default init_method_std ................. 0.02........................default iteration ....................... None........................default launcher ........................ pdsh........................default layernorm_epsilon ............... 1e-05.......................default lazy_mpu_init ................... False.......................default local_rank ...................... None........................default log_grad_norm ................... False.......................default log_gradient_noise_scale ........ False.......................default log_optimizer_states ............ False.......................default log_param_norm .................. False.......................default loss_scale ...................... None........................default loss_scale_window ............... 1000.0......................default make_vocab_size_divisible_by .... 128.........................default master_addr ..................... None........................default master_port ..................... 29500.......................default maximum_tokens .................. 64..........................default min_lr .......................... 0.0.........................default min_scale ....................... 1.0.........................default mmap_warmup ..................... False.......................default model_parallel_size ............. 1...........................default no_load_rng ..................... False.......................default no_save_optim ................... False.......................default no_save_rng ..................... False.......................default norm ............................ layernorm...................default num_gpus ........................ None........................default num_nodes ....................... -1..........................default num_samples ..................... 0...........................default num_unique_layers ............... None........................default num_workers ..................... 2...........................default onnx_safe ....................... False.......................default output_layer_init_method ........ scaled_normal...............default output_layer_parallelism ........ row.........................default override_lr_scheduler ........... False.......................default padded_vocab_size ............... None........................default param_sharing_style ............. grouped.....................default pipe_partition_method ........... type:transformer|mlp........default prescale_gradients .............. False.......................default profile_backward ................ False.......................default rank ............................ None........................default recompute ....................... False.......................default rms_norm_epsilon ................ 1e-08.......................default rotary_emb_base ................. 10000.......................default rotary_pct ...................... 1.0.........................default rpe_max_distance ................ 128.........................default rpe_num_buckets ................. 32..........................default sample_input_file ............... None........................default sample_output_file .............. None........................default scaled_masked_softmax_fusion .... False.......................default scaled_upper_triang_masked_softmax_fusion False..............default scalenorm_epsilon ............... 1e-08.......................default scheduler ....................... None........................default seed ............................ 1234........................default short_seq_prob .................. 0.1.........................default soft_prompt_tuning .............. None........................default sparse_gradients ................ False.......................default steps_per_print ................. 10..........................default temperature ..................... 0.0.........................default test_data_paths ................. None........................default test_data_weights ............... None........................default text_gen_type ................... None........................default tokenizer_type .................. GPT2BPETokenizer............default top_k ........................... 0...........................default top_p ........................... 0.0.........................default train_data_paths ................ None........................default train_data_weights .............. None........................default use_bnb_optimizer ............... False.......................default use_checkpoint_lr_scheduler ..... False.......................default use_cpu_initialization .......... False.......................default valid_data_paths ................ None........................default valid_data_weights .............. None........................default wandb_host ...................... https://api.wandb.ai........default wandb_project ................... neox........................default wandb_team ...................... None........................default warmup .......................... 0.01........................default weight_by_num_documents ......... False.......................default weighted_sampler_alpha .......... 0.3.........................default world_size ...................... None........................default zero_allow_untested_optimizer ... False.......................default ---------------- end of arguments ---------------- [2021-12-16 16:00:00,007] [WARNING] [runner.py:126:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2021-12-16 16:00:00,007] [INFO] [runner.py:366:main] cmd = /mnt/workspace/venv/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=29500 train.py --deepspeed_config {"train_batch_size": 16, "train_micro_batch_size_per_gpu": 4, "optimizer": {"type": "Adam", "params": {"lr": 0.0006, "betas": [0.9, 0.999], "eps": 1e-08}}, "fp16": {"enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1}, "gradient_clipping": 1.0, "zero_optimization": {"stage": 0, "allgather_partitions": true, "allgather_bucket_size": 500000000, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 500000000, "contiguous_gradients": true, "cpu_offload": false}, "wall_clock_breakdown": true} --megatron_config {"train_batch_size": 16, "train_micro_batch_size_per_gpu": 4, "optimizer": {"type": "Adam", "params": {"lr": 0.0006, "betas": [0.9, 0.999], "eps": 1e-08}}, "fp16": {"enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1}, "gradient_clipping": 1.0, "zero_optimization": {"stage": 0, "allgather_partitions": true, "allgather_bucket_size": 500000000, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 500000000, "contiguous_gradients": true, "cpu_offload": false}, "wall_clock_breakdown": true, "precision": "fp16", "num_layers": 12, "hidden_size": 768, "num_attention_heads": 12, "seq_length": 2048, "max_position_embeddings": 2048, "pos_emb": "rotary", "no_weight_tying": true, "attention_config": ["global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global"], "sparsity_config": {}, "lr_decay_style": "cosine", "lr_decay_iters": 320000, "optimizer_type": "Adam", "zero_stage": 0, "zero_reduce_scatter": true, "zero_contiguous_gradients": true, "zero_reduce_bucket_size": 500000000, "zero_allgather_bucket_size": 500000000, "lr": 0.0006, "data_path": "data/enron/enron_text_document", "data_impl": "mmap", "save": "checkpoints", "config_files": {"small.yml": "# GPT-2 pretraining setup\n{\n # parallelism settings ( you will want to change these based on your cluster setup, ideally scheduling pipeline stages\n # across the node boundaries )\n \"pipe-parallel-size\": 1,\n \"model-parallel-size\": 1,\n\n # model settings\n \"num-layers\": 12,\n \"hidden-size\": 768,\n \"num-attention-heads\": 12,\n \"seq-length\": 2048,\n \"max-position-embeddings\": 2048,\n \"norm\": \"layernorm\",\n \"pos-emb\": \"rotary\",\n \"no-weight-tying\": true,\n\n # these should provide some speedup but takes a while to build, set to true if desired\n \"scaled-upper-triang-masked-softmax-fusion\": false,\n \"bias-gelu-fusion\": false,\n\n\n # optimizer settings\n \"optimizer\": {\n \"type\": \"Adam\",\n \"params\": {\n \"lr\": 0.0006,\n \"betas\": [0.9, 0.999],\n \"eps\": 1.0e-8,\n }\n },\n \"zero_optimization\": {\n \"stage\": 0,\n \"allgather_partitions\": True,\n \"allgather_bucket_size\": 500000000,\n \"overlap_comm\": True,\n \"reduce_scatter\": True,\n \"reduce_bucket_size\": 500000000,\n \"contiguous_gradients\": True,\n \"cpu_offload\": False\n },\n\n # batch / data settings\n \"train_micro_batch_size_per_gpu\": 4,\n \"data-impl\": \"mmap\",\n \"split\": \"949,50,1\",\n\n # activation checkpointing\n \"checkpoint-activations\": true,\n \"checkpoint-num-layers\": 1,\n \"partition-activations\": true,\n \"synchronize-each-layer\": true,\n\n # regularization\n \"gradient_clipping\": 1.0,\n \"weight-decay\": 0.5,\n \"hidden-dropout\": 0.0,\n \"attention-dropout\": 0.0,\n\n \"no-load-optim\": true,\n\n # precision settings\n \"fp16\": { \n \"enabled\": true,\n \"loss_scale\": 0,\n \"loss_scale_window\": 1000,\n \"hysteresis\": 2,\n \"min_loss_scale\": 1\n },\n\n # misc. training settings\n \"train-iters\": 320000,\n \"lr-decay-iters\": 320000,\n \"distributed-backend\": \"nccl\",\n \"lr-decay-style\": \"cosine\",\n \"warmup\": 0.01,\n \"save-interval\": 10,\n \"eval-interval\": 1000,\n \"eval-iters\": 10,\n\n # logging\n \"log-interval\": 100,\n \"steps_per_print\": 10,\n \"keep-last-n-checkpoints\": 4,\n \"wall_clock_breakdown\": true,\n}\n", "local_setup.yml": "# Suggested data paths when using GPT-NeoX locally\n{\n \"data-path\": \"data/enron/enron_text_document\",\n \n # or for weighted datasets: \n # \"train-data-paths\": [\"data/enron/enron_text_document\", \"data/enron/enron_text_document\"],\n # \"test-data-paths\": [\"data/enron/enron_text_document\", \"data/enron/enron_text_document\"],\n # \"valid-data-paths\": [\"data/enron/enron_text_document\", \"data/enron/enron_text_document\"],\n # \"train-data-weights\": [1., 2.],\n # \"test-data-weights\": [2., 1.],\n # \"valid-data-weights\": [0.5, 0.4],\n\n # If weight_by_num_documents is True, Builds dataset weights from a multinomial distribution over groups of data according to the number of documents in each group. \n # WARNING: setting this to True will override any user provided weights\n # \"weight_by_num_documents\": false,\n # \"weighted_sampler_alpha\": 0.3,\n\n \"vocab-file\": \"data/gpt2-vocab.json\",\n \"merge-file\": \"data/gpt2-merges.txt\",\n\n \"save\": \"checkpoints\",\n \"load\": \"checkpoints\",\n \"checkpoint_validation_with_forward_pass\": False,\n \n \"tensorboard-dir\": \"tensorboard\",\n \"log-dir\": \"logs\",\n \"use_wandb\": False,\n \"wandb_host\": \"https://api.wandb.ai\",\n \"wandb_project\": \"neox\"\n}\n"}, "load": "checkpoints", "save_interval": 10, "no_load_optim": true, "batch_size": 4, "train_iters": 320000, "eval_iters": 10, "keep_last_n_checkpoints": 4, "split": "949,50,1", "vocab_file": "data/gpt2-vocab.json", "merge_file": "data/gpt2-merges.txt", "attention_dropout": 0.0, "hidden_dropout": 0.0, "weight_decay": 0.5, "checkpoint_activations": true, "synchronize_each_layer": true, "partition_activations": true, "gas": 1, "clip_grad": 1.0, "dynamic_loss_scale": true, "pipe_parallel_size": 1, "is_pipe_parallel": true, "use_wandb": false, "wandb_group": "Mf6keriq8236v5DsKyvGvG_fxq3a3ah", "log_dir": "logs", "tensorboard_dir": "tensorboard", "log_interval": 100, "user_script": "train.py", "global_num_gpus": 4} [2021-12-16 16:00:00,941] [INFO] [launch.py:75:main] 0 NV_LIBNCCL_DEV_PACKAGE libnccl-dev=2.8.4-1+cuda11.1 [2021-12-16 16:00:00,941] [INFO] [launch.py:75:main] 0 NCCL_VERSION 2.8.4-1 [2021-12-16 16:00:00,941] [INFO] [launch.py:75:main] 0 NV_LIBNCCL_PACKAGE_VERSION 2.8.4-1 [2021-12-16 16:00:00,941] [INFO] [launch.py:75:main] 0 NV_LIBNCCL_PACKAGE libnccl2=2.8.4-1+cuda11.1 [2021-12-16 16:00:00,941] [INFO] [launch.py:75:main] 0 NV_LIBNCCL_DEV_PACKAGE_NAME libnccl-dev [2021-12-16 16:00:00,941] [INFO] [launch.py:75:main] 0 NV_LIBNCCL_PACKAGE_NAME libnccl2 [2021-12-16 16:00:00,941] [INFO] [launch.py:75:main] 0 NV_LIBNCCL_DEV_PACKAGE_VERSION 2.8.4-1 [2021-12-16 16:00:00,942] [INFO] [launch.py:82:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]} [2021-12-16 16:00:00,942] [INFO] [launch.py:88:main] nnodes=1, num_local_procs=4, node_rank=0 [2021-12-16 16:00:00,942] [INFO] [launch.py:103:main] global_rank_mapping=defaultdict(, {'localhost': [0, 1, 2, 3]}) [2021-12-16 16:00:00,942] [INFO] [launch.py:104:main] dist_world_size=4 [2021-12-16 16:00:00,942] [INFO] [launch.py:112:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3 [2021-12-16 16:00:04,255] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl NeoXArgs.configure_distributed_args() using world size: 4 and model-parallel size: 1 > building GPT2BPETokenizer tokenizer ... > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304) [2021-12-16 16:00:04,374] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl > setting tensorboard ... > initializing torch distributed ... [2021-12-16 16:00:04,410] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl [2021-12-16 16:00:04,417] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl > initializing model parallel with size 1 MPU DP: [0, 1, 2, 3] MPU PP: [0] MPU PP: [1] MPU PP: [2] MPU PP: [3] MPU MP: [0] MPU MP: [1] MPU MP: [2] MPU MP: [3] > setting random seeds to 1234 ... [2021-12-16 16:00:05,453] [INFO] [checkpointing.py:223:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234 make: Entering directory '/mnt/workspace/gpt-neox/megatron/data' make: Nothing to be done for 'default'. make: Leaving directory '/mnt/workspace/gpt-neox/megatron/data' building GPT2 model ... SEED_LAYERS=False BASE_SEED=1234 SEED_FN=None Using topology: {ProcessCoord(pipe=0, data=0, model=0): 0, ProcessCoord(pipe=0, data=1, model=0): 1, ProcessCoord(pipe=0, data=2, model=0): 2, ProcessCoord(pipe=0, data=3, model=0): 3} [2021-12-16 16:00:09,950] [INFO] [module.py:363:_partition_layers] Partitioning pipeline stages with method type:transformer|mlp stage=0 layers=17 0: EmbeddingPipe 1: _pre_transformer_block 2: ParallelTransformerLayerPipe 3: ParallelTransformerLayerPipe 4: ParallelTransformerLayerPipe 5: ParallelTransformerLayerPipe 6: ParallelTransformerLayerPipe 7: ParallelTransformerLayerPipe 8: ParallelTransformerLayerPipe 9: ParallelTransformerLayerPipe 10: ParallelTransformerLayerPipe 11: ParallelTransformerLayerPipe 12: ParallelTransformerLayerPipe 13: ParallelTransformerLayerPipe 14: _post_transformer_block 15: NormPipe 16: ParallelLinearPipe loss: partial [2021-12-16 16:00:09,977] [WARNING] [config.py:77:_sanity_check] DeepSpeedConfig: cpu_offload is deprecated. Please use offload_optimizer. [2021-12-16 16:00:09,978] [WARNING] [config.py:77:_sanity_check] DeepSpeedConfig: cpu_offload is deprecated. Please use offload_optimizer. DeepSpeed is enabled. [2021-12-16 16:00:09,979] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed info: version=0.3.15+eb7f5cf, git-hash=eb7f5cf, git-branch=main [2021-12-16 16:00:09,979] [WARNING] [config.py:77:_sanity_check] DeepSpeedConfig: cpu_offload is deprecated. Please use offload_optimizer. [2021-12-16 16:00:09,980] [WARNING] [config.py:77:_sanity_check] DeepSpeedConfig: cpu_offload is deprecated. Please use offload_optimizer. [2021-12-16 16:00:11,332] [INFO] [config.py:759:print] DeepSpeedEngine configuration: [2021-12-16 16:00:11,333] [INFO] [config.py:763:print] activation_checkpointing_config { "partition_activations": false, "contiguous_memory_optimization": false, "cpu_checkpointing": false, "number_checkpoints": null, "synchronize_checkpoint_boundary": false, "profile": false } [2021-12-16 16:00:11,333] [INFO] [config.py:763:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True} [2021-12-16 16:00:11,333] [INFO] [config.py:763:print] allreduce_always_fp32 ........ False [2021-12-16 16:00:11,333] [INFO] [config.py:763:print] amp_enabled .................. False [2021-12-16 16:00:11,333] [INFO] [config.py:763:print] amp_params ................... False [2021-12-16 16:00:11,333] [INFO] [config.py:763:print] checkpoint_tag_validation_enabled True [2021-12-16 16:00:11,333] [INFO] [config.py:763:print] checkpoint_tag_validation_fail False [2021-12-16 16:00:11,333] [INFO] [config.py:763:print] disable_allgather ............ False [2021-12-16 16:00:11,333] [INFO] [config.py:763:print] dump_state ................... False [2021-12-16 16:00:11,334] [INFO] [config.py:763:print] dynamic_loss_scale_args ...... {'init_scale': 4294967296, 'scale_window': 1000, 'delayed_shift': 2, 'min_scale': 1} [2021-12-16 16:00:11,334] [INFO] [config.py:763:print] elasticity_enabled ........... False [2021-12-16 16:00:11,334] [INFO] [config.py:763:print] flops_profiler_config ........ { "enabled": false, "profile_step": 1, "module_depth": -1, "top_modules": 3, "detailed": true } [2021-12-16 16:00:11,334] [INFO] [config.py:763:print] fp16_enabled ................. True [2021-12-16 16:00:11,334] [INFO] [config.py:763:print] fp16_type .................... fp16 [2021-12-16 16:00:11,334] [INFO] [config.py:763:print] global_rank .................. 0 [2021-12-16 16:00:11,334] [INFO] [config.py:763:print] gradient_accumulation_steps .. 1 [2021-12-16 16:00:11,334] [INFO] [config.py:763:print] gradient_clipping ............ 1.0 [2021-12-16 16:00:11,334] [INFO] [config.py:763:print] gradient_predivide_factor .... 1.0 [2021-12-16 16:00:11,334] [INFO] [config.py:763:print] initial_dynamic_scale ........ 4294967296 [2021-12-16 16:00:11,334] [INFO] [config.py:763:print] loss_scale ................... 0 [2021-12-16 16:00:11,334] [INFO] [config.py:763:print] memory_breakdown ............. False [2021-12-16 16:00:11,334] [INFO] [config.py:763:print] optimizer_legacy_fusion ...... False [2021-12-16 16:00:11,334] [INFO] [config.py:763:print] optimizer_name ............... adam [2021-12-16 16:00:11,334] [INFO] [config.py:763:print] optimizer_params ............. {'lr': 0.0006, 'betas': [0.9, 0.999], 'eps': 1e-08} [2021-12-16 16:00:11,334] [INFO] [config.py:763:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0} [2021-12-16 16:00:11,334] [INFO] [config.py:763:print] pld_enabled .................. False [2021-12-16 16:00:11,334] [INFO] [config.py:763:print] pld_params ................... False [2021-12-16 16:00:11,334] [INFO] [config.py:763:print] precision .................... torch.float16 [2021-12-16 16:00:11,334] [INFO] [config.py:763:print] prescale_gradients ........... False [2021-12-16 16:00:11,334] [INFO] [config.py:763:print] scheduler_name ............... None [2021-12-16 16:00:11,334] [INFO] [config.py:763:print] scheduler_params ............. None [2021-12-16 16:00:11,334] [INFO] [config.py:763:print] sparse_attention ............. None [2021-12-16 16:00:11,334] [INFO] [config.py:763:print] sparse_gradients_enabled ..... False [2021-12-16 16:00:11,334] [INFO] [config.py:763:print] steps_per_print .............. 10 [2021-12-16 16:00:11,334] [INFO] [config.py:763:print] tensorboard_enabled .......... False [2021-12-16 16:00:11,334] [INFO] [config.py:763:print] tensorboard_job_name ......... DeepSpeedJobName [2021-12-16 16:00:11,334] [INFO] [config.py:763:print] tensorboard_output_path ...... [2021-12-16 16:00:11,334] [INFO] [config.py:763:print] train_batch_size ............. 16 [2021-12-16 16:00:11,334] [INFO] [config.py:763:print] train_micro_batch_size_per_gpu 4 [2021-12-16 16:00:11,334] [INFO] [config.py:763:print] wall_clock_breakdown ......... True [2021-12-16 16:00:11,334] [INFO] [config.py:763:print] world_size ................... 4 [2021-12-16 16:00:11,334] [INFO] [config.py:763:print] zero_allow_untested_optimizer False [2021-12-16 16:00:11,335] [INFO] [config.py:763:print] zero_config .................. { "stage": 0, "contiguous_gradients": true, "reduce_scatter": true, "reduce_bucket_size": 5.000000e+08, "allgather_partitions": true, "allgather_bucket_size": 5.000000e+08, "overlap_comm": true, "load_from_fp32_weights": true, "elastic_checkpoint": true, "offload_param": null, "offload_optimizer": null, "sub_group_size": 1.000000e+12, "prefetch_bucket_size": 5.000000e+07, "param_persistence_threshold": 1.000000e+05, "max_live_parameters": 1.000000e+09, "max_reuse_distance": 1.000000e+09, "gather_fp16_weights_on_model_save": false } [2021-12-16 16:00:11,335] [INFO] [config.py:763:print] zero_enabled ................. False [2021-12-16 16:00:11,335] [INFO] [config.py:763:print] zero_optimization_stage ...... 0 Using /root/.cache/torch_extensions as PyTorch extensions root... Using /root/.cache/torch_extensions as PyTorch extensions root... [2021-12-16 16:00:11,335] [INFO] [config.py:765:print] json = { "train_batch_size": 16, "train_micro_batch_size_per_gpu": 4, "optimizer": { "type": "Adam", "params": { "lr": 0.0006, "betas": [0.9, 0.999], "eps": 1e-08 } }, "fp16": { "enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1 }, "gradient_clipping": 1.0, "zero_optimization": { "stage": 0, "allgather_partitions": true, "allgather_bucket_size": 5.000000e+08, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 5.000000e+08, "contiguous_gradients": true, "cpu_offload": false }, "wall_clock_breakdown": true } Using /root/.cache/torch_extensions as PyTorch extensions root... Using /root/.cache/torch_extensions as PyTorch extensions root... Emitting ninja build file /root/.cache/torch_extensions/utils/build.ninja... Building extension module utils... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module utils... Time to load utils op: 0.3736841678619385 seconds Loading extension module utils... Loading extension module utils... Time to load utils op: 0.4020102024078369 seconds Loading extension module utils... Time to load utils op: 0.40239429473876953 seconds Time to load utils op: 0.4019603729248047 seconds [2021-12-16 16:00:11,739] [INFO] [engine.py:84:__init__] CONFIG: micro_batches=1 micro_batch_size=4 [2021-12-16 16:00:11,895] [INFO] [engine.py:141:__init__] RANK=0 STAGE=0 LAYERS=17 [0, 17) STAGE_PARAMS=162322944 (162.323M) TOTAL_PARAMS=162322944 (162.323M) UNIQUE_PARAMS=162322944 (162.323M) > number of parameters on model parallel rank 0: 162322944 > total params: 162,322,944 [2021-12-16 16:00:13,283] [INFO] [engine.py:1551:_load_checkpoint] rank: 3 loading checkpoint: checkpoints/global_step30/mp_rank_00_model_states.pt [2021-12-16 16:00:13,283] [INFO] [engine.py:1551:_load_checkpoint] rank: 2 loading checkpoint: checkpoints/global_step30/mp_rank_00_model_states.pt [2021-12-16 16:00:13,283] [INFO] [engine.py:1551:_load_checkpoint] rank: 0 loading checkpoint: checkpoints/global_step30/mp_rank_00_model_states.pt [2021-12-16 16:00:13,283] [INFO] [engine.py:1551:_load_checkpoint] rank: 1 loading checkpoint: checkpoints/global_step30/mp_rank_00_model_states.pt [2021-12-16 16:00:14,495] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=0 file=checkpoints/global_step30/layer_00-model_00-model_states.pt [2021-12-16 16:00:14,505] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=2 file=checkpoints/global_step30/layer_02-model_00-model_states.pt [2021-12-16 16:00:14,513] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=3 file=checkpoints/global_step30/layer_03-model_00-model_states.pt [2021-12-16 16:00:14,521] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=4 file=checkpoints/global_step30/layer_04-model_00-model_states.pt [2021-12-16 16:00:14,529] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=5 file=checkpoints/global_step30/layer_05-model_00-model_states.pt [2021-12-16 16:00:14,537] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=6 file=checkpoints/global_step30/layer_06-model_00-model_states.pt [2021-12-16 16:00:14,546] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=7 file=checkpoints/global_step30/layer_07-model_00-model_states.pt [2021-12-16 16:00:14,556] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=8 file=checkpoints/global_step30/layer_08-model_00-model_states.pt [2021-12-16 16:00:14,566] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=9 file=checkpoints/global_step30/layer_09-model_00-model_states.pt [2021-12-16 16:00:14,575] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=10 file=checkpoints/global_step30/layer_10-model_00-model_states.pt [2021-12-16 16:00:14,584] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=11 file=checkpoints/global_step30/layer_11-model_00-model_states.pt [2021-12-16 16:00:14,593] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=12 file=checkpoints/global_step30/layer_12-model_00-model_states.pt [2021-12-16 16:00:14,602] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=13 file=checkpoints/global_step30/layer_13-model_00-model_states.pt [2021-12-16 16:00:14,602] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=15 file=checkpoints/global_step30/layer_15-model_00-model_states.pt [2021-12-16 16:00:14,677] [INFO] [module.py:576:load_state_dir] RANK=0 Loaded layer=16 file=checkpoints/global_step30/layer_16-model_00-model_states.pt > validated currently set args with arguments in the checkpoint ... successfully loaded checkpoints/global_step30/mp_rank_00_model_states.pt Loading checkpoint and starting from iteration 30 > building train, validation, and test datasets ... reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... > dataset split: train: document indices in [0, 491014) total of 491014 documents validation: document indices in [491014, 516884) total of 25870 documents test: document indices in [516884, 517401) total of 517 documents > loading doc-idx mapping from data/enron/enron_text_document_train_indexmap_5120000ns_2048sl_1234s_doc_idx.npy > loading sample-idx mapping from data/enron/enron_text_document_train_indexmap_5120000ns_2048sl_1234s_sample_idx.npy > loading shuffle-idx mapping from data/enron/enron_text_document_train_indexmap_5120000ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.002 seconds total number of samples: 5172595 total number of epochs: 38 WARNING: shuffle index length (5172593) is not equal to sample index length (5172594) WARNING: shuffle index length (5172593) is not equal to sample index length (5172594) WARNING: shuffle index length (5172593) is not equal to sample index length (5172594)WARNING: shuffle index length (5172593) is not equal to sample index length (5172594) |
90,000 steps is a lot for a dataset as small as Enron. You're doing multiple epochs and the model is diverging. I recommend using weight decay set to Perhaps this was a bad idea, but the quick start was not particularly intended to be run as-is for all 320,000 steps. That would be nearly 10 epochs on the Enron data which is quite excessive and not likely to produce a particularly good model. We used Enron in the demo because it's small and therefore can be downloaded and tokenized quickly, rather than because it's a good starter dataset. What would you like to see in the quick start guide? Any advice about how to stop others from falling into this pitfall would be appreciated. |
This is not a bug - this is completely intended behaviour. The configs in this repo are not necessarily ready to go, depending on your dataset / model configuration you'll need to do tuning, they're just there for example purposes. I suggest you have a read of how mixed precision training works. Stella made some good points, another thing you can try is to change the dynamic loss scaling parameters. |
@StellaAthena @pwstegman Thanks for your helpful feedback. The program is currently running with a weight decay of 0.15. So far no errors, let's see if it terminates. @StellaAthena You asked what I would like to see in the quick start. What would be helpful, is a pre-created config file that works well with the given dataset and is referenced in the quick start guide. Basically, design the guide in a way that it can be followed blindly to train a first model. |
Describe the bug
Hi, I'm getting an error trying to follow the "Quick Start" guide (https://github.com/EleutherAI/gpt-neox#quick-start):
Skipped 50 iterations in a row due to Overflow - Exiting training.
The error comes about after checkpoint global_step90000 is created. This is the traceback:
Traceback (most recent call last):
File "train.py", line 27, in
pretrain(neox_args=neox_args)
File "~/gpt-neox/megatron/training.py", line 103, in pretrain
iteration = train(
File "~/gpt-neox/megatron/training.py", line 562, in train
overflow_monitor.check(skipped_iter) # check for repeated overflow
File "~/gpt-neox/megatron/utils.py", line 353, in check
raise Exception(
Exception: Skipped 50 iterations in a row due to Overflow - Exiting training.
To Reproduce
I'm using an Anaconda environment with Python 3.8.12 and PyTorch 1.10.0 (configured for GPU).
I executed the following two commands from the repository root:
pip install -r requirements/requirements.txt
python ./megatron/fused_kernels/setup.py install
Then I downloaded preconfigured datasets by executing:
python ./prepare_data.py
Afterwards I got the enron data by running:
python prepare_data.py enron -t CharLevelTokenizer -d ./data/
Finally, I ran the pretraining module as:
python ./deepy.py train.py -d configs small.yml local_setup.yml
I didn't make any changes to the pre-existing config files.
Could somebody tell me what I'm doing wrong? Thanks in advance!
The text was updated successfully, but these errors were encountered: