Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Training followed tutorial] error: exits with return code = -7 #940

Closed
phamkhactu opened this issue May 17, 2023 · 6 comments
Closed

[Training followed tutorial] error: exits with return code = -7 #940

phamkhactu opened this issue May 17, 2023 · 6 comments
Labels
bug Something isn't working

Comments

@phamkhactu
Copy link

Thanks for excellent repo

I follow tutorial to train models, but I get error

[2023-05-17 11:25:12,956] [INFO] [engine.py:88:__init__] CONFIG: micro_batches=1 micro_batch_size=1
[2023-05-17 11:25:13,053] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 2420
[2023-05-17 11:25:13,146] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 2421
[2023-05-17 11:25:13,146] [ERROR] [launch.py:324:sigkill_handler] ['/usr/bin/python', '-u', 'train.py', '--local_rank=1', '--deepspeed_config', 'eyJ0cmFpbl9iYXRjaF9zaXplIjogMiwgInRyYWluX21pY3JvX2JhdGNoX3NpemVfcGVyX2dwdSI6IDEsICJvcHRpbWl6ZXIiOiB7InR5cGUiOiAiQWRhbSIsICJwYXJhbXMiOiB7ImxyIjogMC4wMDA2LCAiYmV0YXMiOiBbMC45LCAwLjk5OV0sICJlcHMiOiAxZS0wOH19LCAiZnAzMl9hbGxyZWR1Y2UiOiB0cnVlLCAiZnAxNiI6IHsiZW5hYmxlZCI6IHRydWUsICJ0eXBlIjogImJmbG9hdDE2IiwgImxvc3Nfc2NhbGUiOiAwLCAibG9zc19zY2FsZV93aW5kb3ciOiAxMDAwLCAiaHlzdGVyZXNpcyI6IDIsICJtaW5fbG9zc19zY2FsZSI6IDF9LCAiemVyb19vcHRpbWl6YXRpb24iOiB7InN0YWdlIjogMCwgImFsbGdhdGhlcl9wYXJ0aXRpb25zIjogdHJ1ZSwgImFsbGdhdGhlcl9idWNrZXRfc2l6ZSI6IDUwMDAwMDAwMCwgIm92ZXJsYXBfY29tbSI6IHRydWUsICJyZWR1Y2Vfc2NhdHRlciI6IHRydWUsICJyZWR1Y2VfYnVja2V0X3NpemUiOiA1MDAwMDAwMDAsICJjb250aWd1b3VzX2dyYWRpZW50cyI6IHRydWV9LCAid2FsbF9jbG9ja19icmVha2Rvd24iOiB0cnVlfQ==', '--megatron_config', '/workspace/megatron_config.json'] exits with return code = -7
root@f3bc749fc97e:/workspace# python ./deepy.py train.py -d configs bf16_125M.yml local_setup.yml

My steps

  1. pip install -r requirements/requirements.txt -->python ./megatron/fused_kernels/setup.py install
  2. python prepare_data.py -d ./data
  3. python ./deepy.py train.py -d configs bf16_125M.yml local_setup.yml

Environment:

  • GPUs: 2 (3090 RTX 24GB/1)
  • Configs:

Here is my full logs

  data_impl ....................... mmap........................updated
  data_path ....................... data/enwik8/enwik8_text_documentupdated
  dynamic_loss_scale .............. True........................updated
  eval_iters ...................... 10..........................updated
  fp16 ............................ {'enabled': True, 'type': 'bfloat16', 'loss_scale': 0, 'loss_scale_window': 1000, 'hysteresis': 2, 'min_loss_scale': 1}updated
  fp32_allreduce .................. True........................updated
  gas ............................. 1...........................updated
  global_num_gpus ................. 2...........................updated
  hidden_dropout .................. 0.0.........................updated
  hidden_size ..................... 768.........................updated
  is_pipe_parallel ................ True........................updated
  keep_last_n_checkpoints ......... 4...........................updated
  load ............................ checkpoints.................updated
  log_dir ......................... logs........................updated
  lr .............................. 0.0006......................updated
  lr_decay_iters .................. 320000......................updated
  lr_decay_style .................. cosine......................updated
  max_position_embeddings ......... 2048........................updated
  merge_file ...................... data/gpt2-merges.txt........updated
  no_weight_tying ................. True........................updated
  num_attention_heads ............. 12..........................updated
  num_layers ...................... 12..........................updated
  optimizer ....................... {'type': 'Adam', 'params': {'lr': 0.0006, 'betas': [0.9, 0.999], 'eps': 1e-08}}updated
  optimizer_type .................. Adam........................updated
  partition_activations ........... True........................updated
  pipe_parallel_size .............. 1...........................updated
  pos_emb ......................... rotary......................updated
  precision ....................... fp16........................updated
  save ............................ checkpoints.................updated
  save_iters ...................... [10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 110000, 120000, 130000, 140000, 150000, 160000, 170000, 180000, 190000, 200000, 210000, 220000, 230000, 240000, 250000, 260000, 270000, 280000, 290000, 300000, 310000]updated
  seq_length ...................... 2048........................updated
  sparsity_config ................. {}..........................updated
  split ........................... 949,50,1....................updated
  synchronize_each_layer .......... True........................updated
  tensorboard_dir ................. tensorboard.................updated
  text_gen_type ................... unconditional...............updated
  train_batch_size ................ 2...........................updated
  train_iters ..................... 320000......................updated
  train_micro_batch_size_per_gpu .. 1...........................updated
  use_wandb ....................... True........................updated
  user_script ..................... train.py....................updated
  vocab_file ...................... data/gpt2-vocab.json........updated
  wall_clock_breakdown ............ True........................updated
  wandb_group ..................... u3oxfjlt_zz8q3x3f...........updated
  weight_decay .................... 0.0.........................updated
  zero_allgather_bucket_size ...... 500000000...................updated
  zero_contiguous_gradients ....... True........................updated
  zero_optimization ............... {'stage': 0, 'allgather_partitions': True, 'allgather_bucket_size': 500000000, 'overlap_comm': True, 'reduce_scatter': True, 'reduce_bucket_size': 500000000, 'contiguous_gradients': True}updated
  zero_reduce_bucket_size ......... 500000000...................updated
  zero_reduce_scatter ............. True........................updated
  zero_stage ...................... 0...........................updated
  activation ...................... gelu........................default
  activation_checkpointing ........ None........................default
  adlr_autoresume ................. False.......................default
  adlr_autoresume_interval ........ 1000........................default
  amp ............................. None........................default
  apply_query_key_layer_scaling ... False.......................default
  attention_softmax_in_fp32 ....... False.......................default
  autotuning ...................... None........................default
  autotuning_run .................. None........................default
  base_shapes_file ................ None........................default
  bf16 ............................ None........................default
  bias_dropout_fusion ............. False.......................default
  bias_gelu_fusion ................ False.......................default
  char_level_ppl .................. False.......................default
  checkpoint ...................... None........................default
  checkpoint_in_cpu ............... False.......................default
  checkpoint_num_layers ........... 1...........................default
  checkpoint_scale ................ linear......................default
  checkpoint_validation_with_forward_pass  False................default
  comment ......................... None........................default
  comms_logger .................... None........................default
  communication_data_type ......... None........................default
  compression_training ............ None........................default
  contiguous_checkpointing ........ False.......................default
  coord_check ..................... False.......................default
  csv_monitor ..................... None........................default
  curriculum_learning ............. None........................default
  curriculum_seqlen ............... 0...........................default
  data_efficiency ................. None........................default
  data_types ...................... None........................default
  deepscale ....................... False.......................default
  deepscale_config ................ None........................default
  deepspeed ....................... True........................default
  deepspeed_activation_checkpointing  True......................default
  deepspeed_extra_args ............ None........................default
  deepspeed_mpi ................... False.......................default
  deepspeed_slurm ................. False.......................default
  detect_nvlink_pairs ............. False.......................default
  distributed_backend ............. nccl........................default
  do_test ......................... None........................default
  do_train ........................ None........................default
  do_valid ........................ None........................default
  dump_state ...................... False.......................default
  elasticity ...................... None........................default
  eod_mask_loss ................... False.......................default
  eval_interval ................... 1000........................default
  eval_results_prefix ............. ............................default
  eval_tasks ...................... None........................default
  exclude ......................... None........................default
  exit_interval ................... None........................default
  extra_save_iters ................ None........................default
  finetune ........................ False.......................default
  flops_profiler .................. None........................default
  force_multi ..................... False.......................default
  fp16_lm_cross_entropy ........... False.......................default
  git_hash ........................ None........................default
  gmlp_attn_dim ................... 64..........................default
  gpt_j_residual .................. False.......................default
  gpt_j_tied ...................... False.......................default
  gradient_accumulation_steps ..... 1...........................default
  gradient_clipping ............... 1.0.........................default
  gradient_noise_scale_cpu_offload  False.......................default
  gradient_noise_scale_n_batches .. 5...........................default
  gradient_predivide_factor ....... 1.0.........................default
  hostfile ........................ None........................default
  hysteresis ...................... 2...........................default
  include ......................... None........................default
  init_method ..................... normal......................default
  init_method_std ................. 0.02........................default
  iteration ....................... None........................default
  launcher ........................ pdsh........................default
  layernorm_epsilon ............... 1e-05.......................default
  lazy_mpu_init ................... False.......................default
  local_rank ...................... None........................default
  log_grad_norm ................... False.......................default
  log_grad_pct_zeros .............. False.......................default
  log_gradient_noise_scale ........ False.......................default
  log_interval .................... 100.........................default
  log_optimizer_states ............ False.......................default
  log_param_norm .................. False.......................default
  loss_scale ...................... None........................default
  loss_scale_window ............... 1000.0......................default
  make_vocab_size_divisible_by .... 128.........................default
  master_addr ..................... None........................default
  master_port ..................... 29500.......................default
  maximum_tokens .................. 64..........................default
  min_lr .......................... 0.0.........................default
  min_scale ....................... 1.0.........................default
  mlp_type ........................ regular.....................default
  mmap_warmup ..................... False.......................default
  model_parallel_size ............. 1...........................default
  mup_attn_temp ................... 1.0.........................default
  mup_embedding_mult .............. 1.0.........................default
  mup_init_scale .................. 1.0.........................default
  mup_output_temp ................. 1.0.........................default
  mup_rp_embedding_mult ........... 1.0.........................default
  mup_width_scale ................. 2...........................default
  no_load_optim ................... False.......................default
  no_load_rng ..................... False.......................default
  no_save_optim ................... False.......................default
  no_save_rng ..................... False.......................default
  no_ssh_check .................... False.......................default
  norm ............................ layernorm...................default
  num_gpus ........................ None........................default
  num_nodes ....................... -1..........................default
  num_samples ..................... 1...........................default
  num_unique_layers ............... None........................default
  num_workers ..................... 2...........................default
  onnx_safe ....................... False.......................default
  opt_pos_emb_offset .............. 0...........................default
  output_layer_init_method ........ scaled_normal...............default
  output_layer_parallelism ........ column......................default
  override_lr_scheduler ........... False.......................default
  padded_vocab_size ............... None........................default
  param_sharing_style ............. grouped.....................default
  pipe_partition_method ........... type:transformer|mlp........default
  prescale_gradients .............. False.......................default
  profile_backward ................ False.......................default
  prompt_end ...................... 
...........................default
  rank ............................ None........................default
  recompute ....................... False.......................default
  return_logits ................... False.......................default
  rms_norm_epsilon ................ 1e-08.......................default
  rotary_emb_base ................. 10000.......................default
  rotary_pct ...................... 1.0.........................default
  rpe_max_distance ................ 128.........................default
  rpe_num_buckets ................. 32..........................default
  sample_input_file ............... None........................default
  sample_output_file .............. samples.txt.................default
  save_base_shapes ................ False.......................default
  scaled_masked_softmax_fusion .... False.......................default
  scaled_upper_triang_masked_softmax_fusion  False..............default
  scalenorm_epsilon ............... 1e-08.......................default
  scheduler ....................... None........................default
  seed ............................ 1234........................default
  short_seq_prob .................. 0.1.........................default
  soft_prompt_tuning .............. None........................default
  sparse_attention ................ None........................default
  sparse_gradients ................ False.......................default
  steps_per_print ................. 10..........................default
  temperature ..................... 0.0.........................default
  tensorboard ..................... None........................default
  test_data_paths ................. None........................default
  test_data_weights ............... None........................default
  tokenizer_type .................. GPT2BPETokenizer............default
  top_k ........................... 0...........................default
  top_p ........................... 0.0.........................default
  train_data_paths ................ None........................default
  train_data_weights .............. None........................default
  use_bias_in_attn_linear ......... True........................default
  use_bias_in_norms ............... True........................default
  use_bnb_optimizer ............... False.......................default
  use_checkpoint_lr_scheduler ..... False.......................default
  use_cpu_initialization .......... False.......................default
  use_mup ......................... False.......................default
  use_shared_fs ................... True........................default
  valid_data_paths ................ None........................default
  valid_data_weights .............. None........................default
  wandb ........................... None........................default
  wandb_host ...................... https://api.wandb.ai........default
  wandb_init_all_ranks ............ False.......................default
  wandb_project ................... neox........................default
  wandb_team ...................... None........................default
  warmup .......................... 0.01........................default
  weight_by_num_documents ......... False.......................default
  weighted_sampler_alpha .......... 0.3.........................default
  world_size ...................... None........................default
---------------- end of arguments ----------------
NeoXArgs.configure_distributed_args() using world size: 1 and model-parallel size: 1 
[2023-05-17 11:25:07,074] [WARNING] [runner.py:193:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-05-17 11:25:07,074] [INFO] [runner.py:559:main] cmd = /usr/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None train.py --deepspeed_config eyJ0cmFpbl9iYXRjaF9zaXplIjogMiwgInRyYWluX21pY3JvX2JhdGNoX3NpemVfcGVyX2dwdSI6IDEsICJvcHRpbWl6ZXIiOiB7InR5cGUiOiAiQWRhbSIsICJwYXJhbXMiOiB7ImxyIjogMC4wMDA2LCAiYmV0YXMiOiBbMC45LCAwLjk5OV0sICJlcHMiOiAxZS0wOH19LCAiZnAzMl9hbGxyZWR1Y2UiOiB0cnVlLCAiZnAxNiI6IHsiZW5hYmxlZCI6IHRydWUsICJ0eXBlIjogImJmbG9hdDE2IiwgImxvc3Nfc2NhbGUiOiAwLCAibG9zc19zY2FsZV93aW5kb3ciOiAxMDAwLCAiaHlzdGVyZXNpcyI6IDIsICJtaW5fbG9zc19zY2FsZSI6IDF9LCAiemVyb19vcHRpbWl6YXRpb24iOiB7InN0YWdlIjogMCwgImFsbGdhdGhlcl9wYXJ0aXRpb25zIjogdHJ1ZSwgImFsbGdhdGhlcl9idWNrZXRfc2l6ZSI6IDUwMDAwMDAwMCwgIm92ZXJsYXBfY29tbSI6IHRydWUsICJyZWR1Y2Vfc2NhdHRlciI6IHRydWUsICJyZWR1Y2VfYnVja2V0X3NpemUiOiA1MDAwMDAwMDAsICJjb250aWd1b3VzX2dyYWRpZW50cyI6IHRydWV9LCAid2FsbF9jbG9ja19icmVha2Rvd24iOiB0cnVlfQ== --megatron_config /workspace/megatron_config.json
[2023-05-17 11:25:08,041] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.12.10-1+cuda11.6
[2023-05-17 11:25:08,041] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_DEV_PACKAGE_VERSION=2.12.10-1
[2023-05-17 11:25:08,041] [INFO] [launch.py:135:main] 0 NCCL_VERSION=2.12.10-1
[2023-05-17 11:25:08,041] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev
[2023-05-17 11:25:08,041] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_PACKAGE=libnccl2=2.12.10-1+cuda11.6
[2023-05-17 11:25:08,041] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_PACKAGE_NAME=libnccl2
[2023-05-17 11:25:08,041] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_PACKAGE_VERSION=2.12.10-1
[2023-05-17 11:25:08,041] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2023-05-17 11:25:08,041] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=2, node_rank=0
[2023-05-17 11:25:08,041] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2023-05-17 11:25:08,041] [INFO] [launch.py:162:main] dist_world_size=2
[2023-05-17 11:25:08,041] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0,1
fatal: detected dubious ownership in repository at '/workspace'
To add an exception for this directory, call:

        git config --global --add safe.directory /workspace
fatal: detected dubious ownership in repository at '/workspace'
To add an exception for this directory, call:

        git config --global --add safe.directory /workspace
NeoXArgs.configure_distributed_args() using world size: 2 and model-parallel size: 1 
> building GPT2BPETokenizer tokenizer ...
 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
WARNING: TensorBoard writing requested but is not available (are you using PyTorch 1.1.0 or later and do you have tensorboard installed?), no TensorBoard logs will be written.
> initializing torch distributed ...
[2023-05-17 11:25:10,522] [INFO] [comm.py:661:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
> initializing model parallel with size 1
MPU DP: [0, 1]
MPU PP: [0]
MPU PP: [1]
MPU MP: [0]
MPU MP: [1]
> setting random seeds to 1234 ...
[2023-05-17 11:25:11,562] [INFO] [checkpointing.py:227:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
make: Entering directory '/workspace/megatron/data'
make: Nothing to be done for 'default'.
make: Leaving directory '/workspace/megatron/data'
building GPT2 model ...
SEED_LAYERS=False BASE_SEED=1234 SEED_FN=None
Using topology: {ProcessCoord(pipe=0, data=0, model=0): 0, ProcessCoord(pipe=0, data=1, model=0): 1}
[2023-05-17 11:25:12,323] [INFO] [module.py:372:_partition_layers] Partitioning pipeline stages with method type:transformer|mlp
stage=0 layers=17
     0: EmbeddingPipe
     1: _pre_transformer_block
     2: ParallelTransformerLayerPipe
     3: ParallelTransformerLayerPipe
     4: ParallelTransformerLayerPipe
     5: ParallelTransformerLayerPipe
     6: ParallelTransformerLayerPipe
     7: ParallelTransformerLayerPipe
     8: ParallelTransformerLayerPipe
     9: ParallelTransformerLayerPipe
    10: ParallelTransformerLayerPipe
    11: ParallelTransformerLayerPipe
    12: ParallelTransformerLayerPipe
    13: ParallelTransformerLayerPipe
    14: _post_transformer_block
    15: NormPipe
    16: ParallelLinearPipe
  loss: partial
WARNING: APEX not installed - defaulting to deepspeed's fused adam
Configuring Optimizer type: Adam with params: {'lr': 0.0006, 'betas': [0.9, 0.999], 'eps': 1e-08}
WARNING: APEX not installed - defaulting to deepspeed's fused adam
Installed CUDA version 11.6 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Using /root/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Installed CUDA version 11.6 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Using /root/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py38_cu117/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_adam...
Time to load fused_adam op: 0.11235284805297852 seconds
Loading extension module fused_adam...
Time to load fused_adam op: 0.10126042366027832 seconds
> learning rate decay style: cosine
DeepSpeed is enabled.
[2023-05-17 11:25:12,481] [INFO] [logging.py:77:log_dist] [Rank 0] DeepSpeed info: version=0.8.3+5317ca6, git-hash=5317ca6, git-branch=main
[2023-05-17 11:25:12,740] [INFO] [logging.py:77:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2023-05-17 11:25:12,741] [INFO] [logging.py:77:log_dist] [Rank 0] Removing param_group that has no 'params' in the client Optimizer
[2023-05-17 11:25:12,741] [INFO] [logging.py:77:log_dist] [Rank 0] Using client Optimizer as basic optimizer
[2023-05-17 11:25:12,744] [INFO] [logging.py:77:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedAdam
[2023-05-17 11:25:12,744] [INFO] [logging.py:77:log_dist] [Rank 0] Creating fp16 optimizer with dynamic loss scale
[2023-05-17 11:25:12,751] [INFO] [logging.py:77:log_dist] [Rank 0] DeepSpeed Final Optimizer = FusedAdam
[2023-05-17 11:25:12,751] [INFO] [logging.py:77:log_dist] [Rank 0] DeepSpeed using client LR scheduler
[2023-05-17 11:25:12,751] [INFO] [logging.py:77:log_dist] [Rank 0] DeepSpeed LR Scheduler = <megatron.learning_rates.AnnealingLR object at 0x7f4a237ec3a0>
[2023-05-17 11:25:12,751] [INFO] [logging.py:77:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0], mom=[[0.9, 0.999]]
[2023-05-17 11:25:12,751] [INFO] [config.py:1018:print] DeepSpeedEngine configuration:
[2023-05-17 11:25:12,752] [INFO] [config.py:1022:print]   activation_checkpointing_config  {
    "partition_activations": false, 
    "contiguous_memory_optimization": false, 
    "cpu_checkpointing": false, 
    "number_checkpoints": null, 
    "synchronize_checkpoint_boundary": false, 
    "profile": false
}
[2023-05-17 11:25:12,752] [INFO] [config.py:1022:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2023-05-17 11:25:12,752] [INFO] [config.py:1022:print]   amp_enabled .................. False
[2023-05-17 11:25:12,752] [INFO] [config.py:1022:print]   amp_params ................... False
[2023-05-17 11:25:12,752] [INFO] [config.py:1022:print]   autotuning_config ............ {
    "enabled": false, 
    "start_step": null, 
    "end_step": null, 
    "metric_path": null, 
    "arg_mappings": null, 
    "metric": "throughput", 
    "model_info": null, 
    "results_dir": "autotuning_results", 
    "exps_dir": "autotuning_exps", 
    "overwrite": true, 
    "fast": true, 
    "start_profile_step": 3, 
    "end_profile_step": 5, 
    "tuner_type": "gridsearch", 
    "tuner_early_stopping": 5, 
    "tuner_num_trials": 50, 
    "model_info_path": null, 
    "mp_size": 1, 
    "max_train_batch_size": null, 
    "min_train_batch_size": 1, 
    "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
    "min_train_micro_batch_size_per_gpu": 1, 
    "num_tuning_micro_batch_sizes": 3
}
[2023-05-17 11:25:12,752] [INFO] [config.py:1022:print]   bfloat16_enabled ............. False
[2023-05-17 11:25:12,752] [INFO] [config.py:1022:print]   checkpoint_parallel_write_pipeline  False
[2023-05-17 11:25:12,752] [INFO] [config.py:1022:print]   checkpoint_tag_validation_enabled  True
[2023-05-17 11:25:12,752] [INFO] [config.py:1022:print]   checkpoint_tag_validation_fail  False
[2023-05-17 11:25:12,752] [INFO] [config.py:1022:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f4a243e3580>
[2023-05-17 11:25:12,753] [INFO] [config.py:1022:print]   communication_data_type ...... None
[2023-05-17 11:25:12,753] [INFO] [config.py:1022:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2023-05-17 11:25:12,753] [INFO] [config.py:1022:print]   curriculum_enabled_legacy .... False
[2023-05-17 11:25:12,753] [INFO] [config.py:1022:print]   curriculum_params_legacy ..... False
[2023-05-17 11:25:12,753] [INFO] [config.py:1022:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2023-05-17 11:25:12,753] [INFO] [config.py:1022:print]   data_efficiency_enabled ...... False
[2023-05-17 11:25:12,753] [INFO] [config.py:1022:print]   dataloader_drop_last ......... False
[2023-05-17 11:25:12,753] [INFO] [config.py:1022:print]   disable_allgather ............ False
[2023-05-17 11:25:12,753] [INFO] [config.py:1022:print]   dump_state ................... False
[2023-05-17 11:25:12,753] [INFO] [config.py:1022:print]   dynamic_loss_scale_args ...... {'init_scale': 65536, 'scale_window': 1000, 'delayed_shift': 2, 'min_scale': 1}
[2023-05-17 11:25:12,753] [INFO] [config.py:1022:print]   eigenvalue_enabled ........... False
[2023-05-17 11:25:12,753] [INFO] [config.py:1022:print]   eigenvalue_gas_boundary_resolution  1
[2023-05-17 11:25:12,753] [INFO] [config.py:1022:print]   eigenvalue_layer_name ........ bert.encoder.layer
[2023-05-17 11:25:12,753] [INFO] [config.py:1022:print]   eigenvalue_layer_num ......... 0
[2023-05-17 11:25:12,753] [INFO] [config.py:1022:print]   eigenvalue_max_iter .......... 100
[2023-05-17 11:25:12,753] [INFO] [config.py:1022:print]   eigenvalue_stability ......... 1e-06
[2023-05-17 11:25:12,753] [INFO] [config.py:1022:print]   eigenvalue_tol ............... 0.01
[2023-05-17 11:25:12,753] [INFO] [config.py:1022:print]   eigenvalue_verbose ........... False
[2023-05-17 11:25:12,753] [INFO] [config.py:1022:print]   elasticity_enabled ........... False
[2023-05-17 11:25:12,753] [INFO] [config.py:1022:print]   flops_profiler_config ........ {
    "enabled": false, 
    "profile_step": 1, 
    "module_depth": -1, 
    "top_modules": 1, 
    "detailed": true, 
    "output_file": null
}
[2023-05-17 11:25:12,753] [INFO] [config.py:1022:print]   fp16_auto_cast ............... False
[2023-05-17 11:25:12,753] [INFO] [config.py:1022:print]   fp16_enabled ................. True
[2023-05-17 11:25:12,753] [INFO] [config.py:1022:print]   fp16_master_weights_and_gradients  False
[2023-05-17 11:25:12,753] [INFO] [config.py:1022:print]   global_rank .................. 0
[2023-05-17 11:25:12,753] [INFO] [config.py:1022:print]   grad_accum_dtype ............. None
[2023-05-17 11:25:12,753] [INFO] [config.py:1022:print]   gradient_accumulation_steps .. 1
[2023-05-17 11:25:12,753] [INFO] [config.py:1022:print]   gradient_clipping ............ 0.0
[2023-05-17 11:25:12,753] [INFO] [config.py:1022:print]   gradient_predivide_factor .... 1.0
[2023-05-17 11:25:12,753] [INFO] [config.py:1022:print]   initial_dynamic_scale ........ 65536
[2023-05-17 11:25:12,753] [INFO] [config.py:1022:print]   load_universal_checkpoint .... False
[2023-05-17 11:25:12,753] [INFO] [config.py:1022:print]   loss_scale ................... 0
[2023-05-17 11:25:12,753] [INFO] [config.py:1022:print]   memory_breakdown ............. False
[2023-05-17 11:25:12,753] [INFO] [config.py:1022:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
[2023-05-17 11:25:12,753] [INFO] [config.py:1022:print]   nebula_config ................ {
    "enabled": false, 
    "persistent_storage_path": null, 
    "persistent_time_interval": 100, 
    "num_of_version_in_retention": 2, 
    "enable_nebula_load": true, 
    "load_path": null
}
[2023-05-17 11:25:12,754] [INFO] [config.py:1022:print]   optimizer_legacy_fusion ...... False
[2023-05-17 11:25:12,754] [INFO] [config.py:1022:print]   optimizer_name ............... adam
[2023-05-17 11:25:12,754] [INFO] [config.py:1022:print]   optimizer_params ............. {'lr': 0.0006, 'betas': [0.9, 0.999], 'eps': 1e-08}
[2023-05-17 11:25:12,754] [INFO] [config.py:1022:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2023-05-17 11:25:12,754] [INFO] [config.py:1022:print]   pld_enabled .................. False
[2023-05-17 11:25:12,754] [INFO] [config.py:1022:print]   pld_params ................... False
[2023-05-17 11:25:12,754] [INFO] [config.py:1022:print]   prescale_gradients ........... False
[2023-05-17 11:25:12,754] [INFO] [config.py:1022:print]   scheduler_name ............... None
[2023-05-17 11:25:12,754] [INFO] [config.py:1022:print]   scheduler_params ............. None
[2023-05-17 11:25:12,754] [INFO] [config.py:1022:print]   sparse_attention ............. None
[2023-05-17 11:25:12,754] [INFO] [config.py:1022:print]   sparse_gradients_enabled ..... False
[2023-05-17 11:25:12,754] [INFO] [config.py:1022:print]   steps_per_print .............. 10
[2023-05-17 11:25:12,754] [INFO] [config.py:1022:print]   train_batch_size ............. 2
[2023-05-17 11:25:12,754] [INFO] [config.py:1022:print]   train_micro_batch_size_per_gpu  1
[2023-05-17 11:25:12,754] [INFO] [config.py:1022:print]   use_node_local_storage ....... False
Using /root/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
[2023-05-17 11:25:12,754] [INFO] [config.py:1022:print]   wall_clock_breakdown ......... True
[2023-05-17 11:25:12,754] [INFO] [config.py:1022:print]   world_size ................... 2
[2023-05-17 11:25:12,754] [INFO] [config.py:1022:print]   zero_allow_untested_optimizer  False
[2023-05-17 11:25:12,754] [INFO] [config.py:1022:print]   zero_config .................. stage=0 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500000000 allgather_partitions=True allgather_bucket_size=500000000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False
[2023-05-17 11:25:12,754] [INFO] [config.py:1022:print]   zero_enabled ................. False
[2023-05-17 11:25:12,754] [INFO] [config.py:1022:print]   zero_force_ds_cpu_optimizer .. True
[2023-05-17 11:25:12,754] [INFO] [config.py:1022:print]   zero_optimization_stage ...... 0
[2023-05-17 11:25:12,754] [INFO] [config.py:1007:print_user_config]   json = {
    "train_batch_size": 2, 
    "train_micro_batch_size_per_gpu": 1, 
    "optimizer": {
        "type": "Adam", 
        "params": {
            "lr": 0.0006, 
            "betas": [0.9, 0.999], 
            "eps": 1e-08
        }
    }, 
    "fp32_allreduce": true, 
    "fp16": {
        "enabled": true, 
        "type": "bfloat16", 
        "loss_scale": 0, 
        "loss_scale_window": 1000, 
        "hysteresis": 2, 
        "min_loss_scale": 1
    }, 
    "zero_optimization": {
        "stage": 0, 
        "allgather_partitions": true, 
        "allgather_bucket_size": 5.000000e+08, 
        "overlap_comm": true, 
        "reduce_scatter": true, 
        "reduce_bucket_size": 5.000000e+08, 
        "contiguous_gradients": true
    }, 
    "wall_clock_breakdown": true
}
Using /root/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Emitting ninja build file /root/.cache/torch_extensions/py38_cu117/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.11551022529602051 seconds
Loading extension module utils...
Time to load utils op: 0.2013239860534668 seconds
[2023-05-17 11:25:12,956] [INFO] [engine.py:88:__init__] CONFIG: micro_batches=1 micro_batch_size=1
[2023-05-17 11:25:13,053] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 2420
[2023-05-17 11:25:13,146] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 2421
[2023-05-17 11:25:13,146] [ERROR] [launch.py:324:sigkill_handler] ['/usr/bin/python', '-u', 'train.py', '--local_rank=1', '--deepspeed_config', 'eyJ0cmFpbl9iYXRjaF9zaXplIjogMiwgInRyYWluX21pY3JvX2JhdGNoX3NpemVfcGVyX2dwdSI6IDEsICJvcHRpbWl6ZXIiOiB7InR5cGUiOiAiQWRhbSIsICJwYXJhbXMiOiB7ImxyIjogMC4wMDA2LCAiYmV0YXMiOiBbMC45LCAwLjk5OV0sICJlcHMiOiAxZS0wOH19LCAiZnAzMl9hbGxyZWR1Y2UiOiB0cnVlLCAiZnAxNiI6IHsiZW5hYmxlZCI6IHRydWUsICJ0eXBlIjogImJmbG9hdDE2IiwgImxvc3Nfc2NhbGUiOiAwLCAibG9zc19zY2FsZV93aW5kb3ciOiAxMDAwLCAiaHlzdGVyZXNpcyI6IDIsICJtaW5fbG9zc19zY2FsZSI6IDF9LCAiemVyb19vcHRpbWl6YXRpb24iOiB7InN0YWdlIjogMCwgImFsbGdhdGhlcl9wYXJ0aXRpb25zIjogdHJ1ZSwgImFsbGdhdGhlcl9idWNrZXRfc2l6ZSI6IDUwMDAwMDAwMCwgIm92ZXJsYXBfY29tbSI6IHRydWUsICJyZWR1Y2Vfc2NhdHRlciI6IHRydWUsICJyZWR1Y2VfYnVja2V0X3NpemUiOiA1MDAwMDAwMDAsICJjb250aWd1b3VzX2dyYWRpZW50cyI6IHRydWV9LCAid2FsbF9jbG9ja19icmVha2Rvd24iOiB0cnVlfQ==', '--megatron_config', '/workspace/megatron_config.json'] exits with return code = -7
@phamkhactu phamkhactu added the bug Something isn't working label May 17, 2023
@StellaAthena
Copy link
Member

The error appears to be

fatal: detected dubious ownership in repository at '/workspace'
To add an exception for this directory, call:

        git config --global --add safe.directory /workspace
fatal: detected dubious ownership in repository at '/workspace'
To add an exception for this directory, call:

        git config --global --add safe.directory /workspace

The reason that this is an issue is detailed here. The quoted passage tells you how to bypass this check, but if you are using a shared computer (e.g., a university cluster) you should not do so without thinking about it very carefully. The most likely core explanation is that something in the permissions of your computer are misconfigured.

@phamkhactu
Copy link
Author

phamkhactu commented May 18, 2023

Hi @StellaAthena
I've set configs as you mention. The error git has disappeared. But I still get error:

Here is my logs:

---------------- end of arguments ----------------
NeoXArgs.configure_distributed_args() using world size: 1 and model-parallel size: 1 
[2023-05-18 02:58:36,147] [WARNING] [runner.py:193:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-05-18 02:58:36,147] [INFO] [runner.py:559:main] cmd = /usr/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None train.py --deepspeed_config eyJ0cmFpbl9iYXRjaF9zaXplIjogMiwgInRyYWluX21pY3JvX2JhdGNoX3NpemVfcGVyX2dwdSI6IDEsICJvcHRpbWl6ZXIiOiB7InR5cGUiOiAiQWRhbSIsICJwYXJhbXMiOiB7ImxyIjogMC4wMDA2LCAiYmV0YXMiOiBbMC45LCAwLjk5OV0sICJlcHMiOiAxZS0wOH19LCAiZnAzMl9hbGxyZWR1Y2UiOiB0cnVlLCAiZnAxNiI6IHsiZW5hYmxlZCI6IHRydWUsICJ0eXBlIjogImJmbG9hdDE2IiwgImxvc3Nfc2NhbGUiOiAwLCAibG9zc19zY2FsZV93aW5kb3ciOiAxMDAwLCAiaHlzdGVyZXNpcyI6IDIsICJtaW5fbG9zc19zY2FsZSI6IDF9LCAiemVyb19vcHRpbWl6YXRpb24iOiB7InN0YWdlIjogMCwgImFsbGdhdGhlcl9wYXJ0aXRpb25zIjogdHJ1ZSwgImFsbGdhdGhlcl9idWNrZXRfc2l6ZSI6IDUwMDAwMDAwMCwgIm92ZXJsYXBfY29tbSI6IHRydWUsICJyZWR1Y2Vfc2NhdHRlciI6IHRydWUsICJyZWR1Y2VfYnVja2V0X3NpemUiOiA1MDAwMDAwMDAsICJjb250aWd1b3VzX2dyYWRpZW50cyI6IHRydWV9LCAid2FsbF9jbG9ja19icmVha2Rvd24iOiB0cnVlfQ== --megatron_config /workspace/megatron_config.json
[2023-05-18 02:58:37,125] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.12.10-1+cuda11.6
[2023-05-18 02:58:37,125] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_DEV_PACKAGE_VERSION=2.12.10-1
[2023-05-18 02:58:37,125] [INFO] [launch.py:135:main] 0 NCCL_VERSION=2.12.10-1
[2023-05-18 02:58:37,125] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev
[2023-05-18 02:58:37,125] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_PACKAGE=libnccl2=2.12.10-1+cuda11.6
[2023-05-18 02:58:37,125] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_PACKAGE_NAME=libnccl2
[2023-05-18 02:58:37,125] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_PACKAGE_VERSION=2.12.10-1
[2023-05-18 02:58:37,125] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2023-05-18 02:58:37,125] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=2, node_rank=0
[2023-05-18 02:58:37,125] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2023-05-18 02:58:37,125] [INFO] [launch.py:162:main] dist_world_size=2
[2023-05-18 02:58:37,125] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0,1
NeoXArgs.configure_distributed_args() using world size: 2 and model-parallel size: 1 
> building GPT2BPETokenizer tokenizer ...
 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
WARNING: TensorBoard writing requested but is not available (are you using PyTorch 1.1.0 or later and do you have tensorboard installed?), no TensorBoard logs will be written.
> initializing torch distributed ...
[2023-05-18 02:58:39,546] [INFO] [comm.py:661:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
> initializing model parallel with size 1
MPU DP: [0, 1]
MPU PP: [0]
MPU PP: [1]
MPU MP: [0]
MPU MP: [1]
> setting random seeds to 1234 ...
[2023-05-18 02:58:39,642] [INFO] [checkpointing.py:227:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
make: Entering directory '/workspace/megatron/data'
make: Nothing to be done for 'default'.
make: Leaving directory '/workspace/megatron/data'
building GPT2 model ...
SEED_LAYERS=False BASE_SEED=1234 SEED_FN=None
Using topology: {ProcessCoord(pipe=0, data=0, model=0): 0, ProcessCoord(pipe=0, data=1, model=0): 1}
[2023-05-18 02:58:40,415] [INFO] [module.py:372:_partition_layers] Partitioning pipeline stages with method type:transformer|mlp
stage=0 layers=17
     0: EmbeddingPipe
     1: _pre_transformer_block
     2: ParallelTransformerLayerPipe
     3: ParallelTransformerLayerPipe
     4: ParallelTransformerLayerPipe
     5: ParallelTransformerLayerPipe
     6: ParallelTransformerLayerPipe
     7: ParallelTransformerLayerPipe
     8: ParallelTransformerLayerPipe
     9: ParallelTransformerLayerPipe
    10: ParallelTransformerLayerPipe
    11: ParallelTransformerLayerPipe
    12: ParallelTransformerLayerPipe
    13: ParallelTransformerLayerPipe
    14: _post_transformer_block
    15: NormPipe
    16: ParallelLinearPipe
  loss: partial
WARNING: APEX not installed - defaulting to deepspeed's fused adam
Configuring Optimizer type: Adam with params: {'lr': 0.0006, 'betas': [0.9, 0.999], 'eps': 1e-08}
WARNING: APEX not installed - defaulting to deepspeed's fused adam
Installed CUDA version 11.6 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Using /root/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Installed CUDA version 11.6 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Using /root/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py38_cu117/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_adam...
Time to load fused_adam op: 0.11797428131103516 seconds
Loading extension module fused_adam...
Time to load fused_adam op: 0.10116434097290039 seconds
> learning rate decay style: cosine
DeepSpeed is enabled.
[2023-05-18 02:58:40,576] [INFO] [logging.py:77:log_dist] [Rank 0] DeepSpeed info: version=0.8.3+5317ca6, git-hash=5317ca6, git-branch=main
[2023-05-18 02:58:40,933] [INFO] [logging.py:77:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2023-05-18 02:58:40,933] [INFO] [logging.py:77:log_dist] [Rank 0] Removing param_group that has no 'params' in the client Optimizer
[2023-05-18 02:58:40,933] [INFO] [logging.py:77:log_dist] [Rank 0] Using client Optimizer as basic optimizer
[2023-05-18 02:58:40,936] [INFO] [logging.py:77:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedAdam
[2023-05-18 02:58:40,936] [INFO] [logging.py:77:log_dist] [Rank 0] Creating fp16 optimizer with dynamic loss scale
[2023-05-18 02:58:40,944] [INFO] [logging.py:77:log_dist] [Rank 0] DeepSpeed Final Optimizer = FusedAdam
[2023-05-18 02:58:40,944] [INFO] [logging.py:77:log_dist] [Rank 0] DeepSpeed using client LR scheduler
[2023-05-18 02:58:40,944] [INFO] [logging.py:77:log_dist] [Rank 0] DeepSpeed LR Scheduler = <megatron.learning_rates.AnnealingLR object at 0x7fb4615a02b0>
[2023-05-18 02:58:40,944] [INFO] [logging.py:77:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0], mom=[[0.9, 0.999]]
[2023-05-18 02:58:40,945] [INFO] [config.py:1018:print] DeepSpeedEngine configuration:
[2023-05-18 02:58:40,945] [INFO] [config.py:1022:print]   activation_checkpointing_config  {
    "partition_activations": false, 
    "contiguous_memory_optimization": false, 
    "cpu_checkpointing": false, 
    "number_checkpoints": null, 
    "synchronize_checkpoint_boundary": false, 
    "profile": false
}
[2023-05-18 02:58:40,945] [INFO] [config.py:1022:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2023-05-18 02:58:40,945] [INFO] [config.py:1022:print]   amp_enabled .................. False
[2023-05-18 02:58:40,945] [INFO] [config.py:1022:print]   amp_params ................... False
[2023-05-18 02:58:40,945] [INFO] [config.py:1022:print]   autotuning_config ............ {
    "enabled": false, 
    "start_step": null, 
    "end_step": null, 
    "metric_path": null, 
    "arg_mappings": null, 
    "metric": "throughput", 
    "model_info": null, 
    "results_dir": "autotuning_results", 
    "exps_dir": "autotuning_exps", 
    "overwrite": true, 
    "fast": true, 
    "start_profile_step": 3, 
    "end_profile_step": 5, 
    "tuner_type": "gridsearch", 
    "tuner_early_stopping": 5, 
    "tuner_num_trials": 50, 
    "model_info_path": null, 
    "mp_size": 1, 
    "max_train_batch_size": null, 
    "min_train_batch_size": 1, 
    "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
    "min_train_micro_batch_size_per_gpu": 1, 
    "num_tuning_micro_batch_sizes": 3
}
[2023-05-18 02:58:40,945] [INFO] [config.py:1022:print]   bfloat16_enabled ............. False
[2023-05-18 02:58:40,946] [INFO] [config.py:1022:print]   checkpoint_parallel_write_pipeline  False
[2023-05-18 02:58:40,946] [INFO] [config.py:1022:print]   checkpoint_tag_validation_enabled  True
Using /root/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
[2023-05-18 02:58:40,946] [INFO] [config.py:1022:print]   checkpoint_tag_validation_fail  False
[2023-05-18 02:58:40,946] [INFO] [config.py:1022:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7fb462199520>
[2023-05-18 02:58:40,946] [INFO] [config.py:1022:print]   communication_data_type ...... None
[2023-05-18 02:58:40,946] [INFO] [config.py:1022:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2023-05-18 02:58:40,946] [INFO] [config.py:1022:print]   curriculum_enabled_legacy .... False
[2023-05-18 02:58:40,946] [INFO] [config.py:1022:print]   curriculum_params_legacy ..... False
[2023-05-18 02:58:40,946] [INFO] [config.py:1022:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2023-05-18 02:58:40,946] [INFO] [config.py:1022:print]   data_efficiency_enabled ...... False
[2023-05-18 02:58:40,946] [INFO] [config.py:1022:print]   dataloader_drop_last ......... False
[2023-05-18 02:58:40,946] [INFO] [config.py:1022:print]   disable_allgather ............ False
[2023-05-18 02:58:40,946] [INFO] [config.py:1022:print]   dump_state ................... False
[2023-05-18 02:58:40,946] [INFO] [config.py:1022:print]   dynamic_loss_scale_args ...... {'init_scale': 65536, 'scale_window': 1000, 'delayed_shift': 2, 'min_scale': 1}
[2023-05-18 02:58:40,946] [INFO] [config.py:1022:print]   eigenvalue_enabled ........... False
[2023-05-18 02:58:40,946] [INFO] [config.py:1022:print]   eigenvalue_gas_boundary_resolution  1
[2023-05-18 02:58:40,946] [INFO] [config.py:1022:print]   eigenvalue_layer_name ........ bert.encoder.layer
[2023-05-18 02:58:40,946] [INFO] [config.py:1022:print]   eigenvalue_layer_num ......... 0
[2023-05-18 02:58:40,946] [INFO] [config.py:1022:print]   eigenvalue_max_iter .......... 100
[2023-05-18 02:58:40,946] [INFO] [config.py:1022:print]   eigenvalue_stability ......... 1e-06
[2023-05-18 02:58:40,946] [INFO] [config.py:1022:print]   eigenvalue_tol ............... 0.01
[2023-05-18 02:58:40,946] [INFO] [config.py:1022:print]   eigenvalue_verbose ........... False
[2023-05-18 02:58:40,946] [INFO] [config.py:1022:print]   elasticity_enabled ........... False
[2023-05-18 02:58:40,946] [INFO] [config.py:1022:print]   flops_profiler_config ........ {
    "enabled": false, 
    "profile_step": 1, 
    "module_depth": -1, 
    "top_modules": 1, 
    "detailed": true, 
    "output_file": null
}
[2023-05-18 02:58:40,947] [INFO] [config.py:1022:print]   fp16_auto_cast ............... False
[2023-05-18 02:58:40,947] [INFO] [config.py:1022:print]   fp16_enabled ................. True
[2023-05-18 02:58:40,947] [INFO] [config.py:1022:print]   fp16_master_weights_and_gradients  False
[2023-05-18 02:58:40,947] [INFO] [config.py:1022:print]   global_rank .................. 0
[2023-05-18 02:58:40,947] [INFO] [config.py:1022:print]   grad_accum_dtype ............. None
[2023-05-18 02:58:40,947] [INFO] [config.py:1022:print]   gradient_accumulation_steps .. 1
[2023-05-18 02:58:40,947] [INFO] [config.py:1022:print]   gradient_clipping ............ 0.0
[2023-05-18 02:58:40,947] [INFO] [config.py:1022:print]   gradient_predivide_factor .... 1.0
[2023-05-18 02:58:40,947] [INFO] [config.py:1022:print]   initial_dynamic_scale ........ 65536
[2023-05-18 02:58:40,947] [INFO] [config.py:1022:print]   load_universal_checkpoint .... False
[2023-05-18 02:58:40,947] [INFO] [config.py:1022:print]   loss_scale ................... 0
[2023-05-18 02:58:40,947] [INFO] [config.py:1022:print]   memory_breakdown ............. False
[2023-05-18 02:58:40,947] [INFO] [config.py:1022:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
[2023-05-18 02:58:40,947] [INFO] [config.py:1022:print]   nebula_config ................ {
    "enabled": false, 
    "persistent_storage_path": null, 
    "persistent_time_interval": 100, 
    "num_of_version_in_retention": 2, 
    "enable_nebula_load": true, 
    "load_path": null
}
[2023-05-18 02:58:40,947] [INFO] [config.py:1022:print]   optimizer_legacy_fusion ...... False
[2023-05-18 02:58:40,947] [INFO] [config.py:1022:print]   optimizer_name ............... adam
[2023-05-18 02:58:40,947] [INFO] [config.py:1022:print]   optimizer_params ............. {'lr': 0.0006, 'betas': [0.9, 0.999], 'eps': 1e-08}
[2023-05-18 02:58:40,947] [INFO] [config.py:1022:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2023-05-18 02:58:40,947] [INFO] [config.py:1022:print]   pld_enabled .................. False
[2023-05-18 02:58:40,947] [INFO] [config.py:1022:print]   pld_params ................... False
[2023-05-18 02:58:40,947] [INFO] [config.py:1022:print]   prescale_gradients ........... False
[2023-05-18 02:58:40,947] [INFO] [config.py:1022:print]   scheduler_name ............... None
[2023-05-18 02:58:40,947] [INFO] [config.py:1022:print]   scheduler_params ............. None
[2023-05-18 02:58:40,948] [INFO] [config.py:1022:print]   sparse_attention ............. None
[2023-05-18 02:58:40,948] [INFO] [config.py:1022:print]   sparse_gradients_enabled ..... False
[2023-05-18 02:58:40,948] [INFO] [config.py:1022:print]   steps_per_print .............. 10
[2023-05-18 02:58:40,948] [INFO] [config.py:1022:print]   train_batch_size ............. 2
[2023-05-18 02:58:40,948] [INFO] [config.py:1022:print]   train_micro_batch_size_per_gpu  1
[2023-05-18 02:58:40,948] [INFO] [config.py:1022:print]   use_node_local_storage ....... False
[2023-05-18 02:58:40,948] [INFO] [config.py:1022:print]   wall_clock_breakdown ......... True
[2023-05-18 02:58:40,948] [INFO] [config.py:1022:print]   world_size ................... 2
[2023-05-18 02:58:40,948] [INFO] [config.py:1022:print]   zero_allow_untested_optimizer  False
[2023-05-18 02:58:40,948] [INFO] [config.py:1022:print]   zero_config .................. stage=0 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500000000 allgather_partitions=True allgather_bucket_size=500000000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False
[2023-05-18 02:58:40,948] [INFO] [config.py:1022:print]   zero_enabled ................. False
[2023-05-18 02:58:40,948] [INFO] [config.py:1022:print]   zero_force_ds_cpu_optimizer .. True
[2023-05-18 02:58:40,948] [INFO] [config.py:1022:print]   zero_optimization_stage ...... 0
[2023-05-18 02:58:40,948] [INFO] [config.py:1007:print_user_config]   json = {
    "train_batch_size": 2, 
    "train_micro_batch_size_per_gpu": 1, 
    "optimizer": {
        "type": "Adam", 
        "params": {
            "lr": 0.0006, 
            "betas": [0.9, 0.999], 
            "eps": 1e-08
        }
    }, 
    "fp32_allreduce": true, 
    "fp16": {
        "enabled": true, 
        "type": "bfloat16", 
        "loss_scale": 0, 
        "loss_scale_window": 1000, 
        "hysteresis": 2, 
        "min_loss_scale": 1
    }, 
    "zero_optimization": {
        "stage": 0, 
        "allgather_partitions": true, 
        "allgather_bucket_size": 5.000000e+08, 
        "overlap_comm": true, 
        "reduce_scatter": true, 
        "reduce_bucket_size": 5.000000e+08, 
        "contiguous_gradients": true
    }, 
    "wall_clock_breakdown": true
}
Using /root/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Emitting ninja build file /root/.cache/torch_extensions/py38_cu117/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.12190628051757812 seconds
Loading extension module utils...
Time to load utils op: 0.2017836570739746 seconds
[2023-05-18 02:58:41,150] [INFO] [engine.py:88:__init__] CONFIG: micro_batches=1 micro_batch_size=1
[2023-05-18 02:58:41,171] [INFO] [engine.py:144:__init__] RANK=0 STAGE=0 LAYERS=17 [0, 17) STAGE_PARAMS=162322944 (162.323M) TOTAL_PARAMS=162322944 (162.323M) UNIQUE_PARAMS=162322944 (162.323M)
 > number of parameters on model parallel rank 0: 162322944
[2023-05-18 02:58:42,137] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 89
[2023-05-18 02:58:42,138] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 90
[2023-05-18 02:58:42,140] [ERROR] [launch.py:324:sigkill_handler] ['/usr/bin/python', '-u', 'train.py', '--local_rank=1', '--deepspeed_config', 'eyJ0cmFpbl9iYXRjaF9zaXplIjogMiwgInRyYWluX21pY3JvX2JhdGNoX3NpemVfcGVyX2dwdSI6IDEsICJvcHRpbWl6ZXIiOiB7InR5cGUiOiAiQWRhbSIsICJwYXJhbXMiOiB7ImxyIjogMC4wMDA2LCAiYmV0YXMiOiBbMC45LCAwLjk5OV0sICJlcHMiOiAxZS0wOH19LCAiZnAzMl9hbGxyZWR1Y2UiOiB0cnVlLCAiZnAxNiI6IHsiZW5hYmxlZCI6IHRydWUsICJ0eXBlIjogImJmbG9hdDE2IiwgImxvc3Nfc2NhbGUiOiAwLCAibG9zc19zY2FsZV93aW5kb3ciOiAxMDAwLCAiaHlzdGVyZXNpcyI6IDIsICJtaW5fbG9zc19zY2FsZSI6IDF9LCAiemVyb19vcHRpbWl6YXRpb24iOiB7InN0YWdlIjogMCwgImFsbGdhdGhlcl9wYXJ0aXRpb25zIjogdHJ1ZSwgImFsbGdhdGhlcl9idWNrZXRfc2l6ZSI6IDUwMDAwMDAwMCwgIm92ZXJsYXBfY29tbSI6IHRydWUsICJyZWR1Y2Vfc2NhdHRlciI6IHRydWUsICJyZWR1Y2VfYnVja2V0X3NpemUiOiA1MDAwMDAwMDAsICJjb250aWd1b3VzX2dyYWRpZW50cyI6IHRydWV9LCAid2FsbF9jbG9ja19icmVha2Rvd24iOiB0cnVlfQ==', '--megatron_config', '/workspace/megatron_config.json'] exits with return code = -7```

@phamkhactu
Copy link
Author

@StellaAthena Thank for your support. I have found my problems.

@chaitanyamalaviya
Copy link

Hi @phamkhactu , would you be able to share how you fixed this issue? I'm running into the same problems.

@phamkhactu
Copy link
Author

Hi @phamkhactu , would you be able to share how you fixed this issue? I'm running into the same problems.

It means that some packages not compatible with env. You should build docker or pull image from sharing docker.
It will be fixed

@tranhd95
Copy link

tranhd95 commented Jun 20, 2023

Hi @phamkhactu , would you be able to share how you fixed this issue? I'm running into the same problems.

Hi, sorry for bumping but I had similar error with the same return code with no detailed explanation. I was running GPT-NeoX in a Docker container in local k8s.

My solution was to increase the shm size of container as it's noted in the README and NCCL's docs. Cheers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants