Distributed training with model parallelism hangs with the recent PR #985

absol13 · 2023-07-04T07:13:59Z

Describe the bug
Hello, I found distributed training with the setting "model-parallel-size": >1 hangs. This situation appears in the source with the PR #958 is merged, and it did not appear in older sources or with the setting "model-parallel-size": 1 at all.

To Reproduce
I share my config file below for reproduction.

Expected behavior
Training should proceed further.

Proposed solution

Screenshots
Specifically, training does not proceed at this point:

gpt-neox-train-dist-master: time (ms) | model and optimizer: 49512.12 | train/valid/test data iterators: 1488.60
gpt-neox-train-dist-master: training ...
gpt-neox-train-dist-master: [2023-07-04 15:47:30,793] [INFO] [checkpointing.py:553:forward] Activation Checkpointing Information
gpt-neox-train-dist-master: [2023-07-04 15:47:30,793] [INFO] [checkpointing.py:554:forward] ----Partition Activations True, CPU CHECKPOINTING False
gpt-neox-train-dist-master: [2023-07-04 15:47:30,793] [INFO] [checkpointing.py:557:forward] ----contiguous Memory Checkpointing False with 32 total layers
gpt-neox-train-dist-master: [2023-07-04 15:47:30,793] [INFO] [checkpointing.py:560:forward] ----Synchronization True
gpt-neox-train-dist-master: [2023-07-04 15:47:30,793] [INFO] [checkpointing.py:561:forward] ----Profiling time in checkpointing False

Also, I report logs from nvidia-smi command.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.08    Driver Version: 510.73.08    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:07:00.0 Off |                    0 |
| N/A   38C    P0    95W / 400W |  15699MiB / 81920MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  On   | 00000000:0B:00.0 Off |                    0 |
| N/A   37C    P0    91W / 400W |  15247MiB / 81920MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  On   | 00000000:48:00.0 Off |                    0 |
| N/A   37C    P0    99W / 400W |  15723MiB / 81920MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM...  On   | 00000000:4C:00.0 Off |                    0 |
| N/A   39C    P0    97W / 400W |  15259MiB / 81920MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM...  On   | 00000000:88:00.0 Off |                    0 |
| N/A   36C    P0    93W / 400W |  15723MiB / 81920MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-SXM...  On   | 00000000:8B:00.0 Off |                    0 |
| N/A   40C    P0    93W / 400W |  15259MiB / 81920MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-SXM...  On   | 00000000:C8:00.0 Off |                    0 |
| N/A   39C    P0   105W / 400W |  15699MiB / 81920MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-SXM...  On   | 00000000:CB:00.0 Off |                    0 |
| N/A   37C    P0    90W / 400W |  15247MiB / 81920MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

In the normal training procedure, GPU power usage reaches near its capacity, but I observed that it is far from its capacity which seems to mean the hanging process.

Environment (please complete the following information):

GPUs: 2 nodes with 8 NVIDIA A100 GPUs with NVLINK, connected with IB.
Configs:
I share my config file for training to help reproducing this situation.

{

  "hostfile": "---",
  # Tokenizer /  checkpoint settings - you will need to change these to the location you have them saved in
  "vocab-file": "---/gptneox/tokenizer.json",
  "save": "./ckpnt",
  "load": "./ckpnt",

  # If finetuning, edit the following to the location of your finetuning dataset:
  "data-path": "---/dataset/pile_00/pile_00_text_document",

  # parallelism settings ( you will want to change these based on your cluster setup, ideally scheduling pipeline stages
  # across the node boundaries )
  "pipe-parallel-size": 1,
  "model-parallel-size": 2, 
  #"make_vocab_size_divisible_by": 1,

  # model settings
  "num-layers": 32,
  "hidden-size": 4096,
  "num-attention-heads": 32,
  "seq-length": 2048,
  "max-position-embeddings": 2048,
  "norm": "rmsnorm",
  "rms_norm_epsilon": 1.0e-6,
  "pos-emb": "alibi",
  "no-weight-tying": true,
  "attention-config": [[["flash"], 32]],
  "gpt_j_residual": false,
  "output_layer_parallelism": "column",

  "scaled-upper-triang-masked-softmax-fusion": true,
  "bias-gelu-fusion": false,
  #"use_bias_in_norms": false,
  #"use_bias_in_attn_linear": false,
  #"mlp_type": "llama",
  #"activation": "silu",

  # init methods
  "init_method": "small_init",
  "output_layer_init_method": "wang_init",

  # optimizer settings
  "optimizer": {
    "type": "Adam",
    "params": {
      "lr": 0.00012,
      "betas": [0.9, 0.95],
      "eps": 1.0e-8,
      }
  },

  "min_lr": 0.00012,

  # for all zero_optimization options, see https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training
  "zero_optimization": {
  "stage": 1,
  "allgather_partitions": True,
  "allgather_bucket_size": 1260000000,
  "overlap_comm": True,
  "reduce_scatter": True,
  "reduce_bucket_size": 1260000000,
  "contiguous_gradients": True,
  },

  # batch / data settings (assuming 96 GPUs)
  "train_micro_batch_size_per_gpu": 1,
  "gradient_accumulation_steps": 1,
  "data-impl": "mmap",
  "split": "995,4,1",

  # activation checkpointing
  "checkpoint-activations": true,
  "checkpoint-num-layers": 1,
  "partition-activations": true,
  "synchronize-each-layer": true,

  # regularization
  "gradient_clipping": 1.0,
  "weight-decay": 0.01,
  "hidden-dropout": 0,
  "attention-dropout": 0,

  # precision settings
  "fp16": {
    "enabled": true,
    "type": "bfloat16",
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "initial_scale_power": 12,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "fp32_allreduce": True,

  # misc. training settings
  "train-iters": 150,
  "lr-decay-iters": 150000,

  "distributed-backend": "nccl",
  "lr-decay-style": "cosine",
  "warmup": 0.01,
  "checkpoint-factor": 500, # this variable previously called `save-interval`
  "eval-interval": 1000,
  "eval-iters": 10,

  # logging
  "log-interval": 1,
  "steps_per_print": 1,
  "keep-last-n-checkpoints": 1,


  ### NEW DATA: ####
  "tokenizer_type": "HFTokenizer",
  "tensorboard-dir": "./tensorboard",
  "log-dir": "./logs",

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

StellaAthena · 2023-07-04T18:29:33Z

@honglu2875

honglu2875 · 2023-07-04T18:42:13Z

@absol13 Would #979 fix the issue? It was my oversight that I didn't realize data could be None. It could be possible that it hangs because some process errored out due to this bug.

absol13 · 2023-07-05T02:21:46Z

@honglu2875 I don't know the exact reason. Anyway, I will try distributed training after #979 is merged and let you know this bug is solved. Thanks.

StellaAthena · 2023-07-05T17:37:35Z

I have reproduced this issue and confirmed that #979 does not fix it.

honglu2875 · 2023-07-06T00:10:46Z

I tried to set up neox from an empty conda env again, and used my own config (pythia-1b) but made sure

it was run on 2 full nodes
pp=1 and mp=2, zero stage 1

On the main branch it just errored out on the first training step instead of hanging. Applying #979, it can train normally (watched it for about 10 training steps).

Will take a look at other items in OP's config later. But hope this helps narrow down the problem.

StellaAthena · 2023-07-06T21:55:24Z

After conferring with @honglu2875 we discovered that I was failing to apply the fix to both nodes. @absol13 It should now work for you on main

absol13 · 2023-07-10T07:13:27Z

Now it works correctly. Thanks to your fast support.

absol13 added the bug Something isn't working label Jul 4, 2023

StellaAthena mentioned this issue Jul 5, 2023

Dataload fix #979

Merged

StellaAthena mentioned this issue Jul 6, 2023

Training gpt stuck at the beginning #988

Closed

absol13 closed this as completed Jul 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed training with model parallelism hangs with the recent PR #985

Distributed training with model parallelism hangs with the recent PR #985

absol13 commented Jul 4, 2023

StellaAthena commented Jul 4, 2023

honglu2875 commented Jul 4, 2023 •

edited

Loading

absol13 commented Jul 5, 2023

StellaAthena commented Jul 5, 2023

honglu2875 commented Jul 6, 2023 •

edited

Loading

StellaAthena commented Jul 6, 2023

absol13 commented Jul 10, 2023

Distributed training with model parallelism hangs with the recent PR #985

Distributed training with model parallelism hangs with the recent PR #985

Comments

absol13 commented Jul 4, 2023

StellaAthena commented Jul 4, 2023

honglu2875 commented Jul 4, 2023 • edited Loading

absol13 commented Jul 5, 2023

StellaAthena commented Jul 5, 2023

honglu2875 commented Jul 6, 2023 • edited Loading

StellaAthena commented Jul 6, 2023

absol13 commented Jul 10, 2023

honglu2875 commented Jul 4, 2023 •

edited

Loading

honglu2875 commented Jul 6, 2023 •

edited

Loading