Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed training with model parallelism hangs with the recent PR #985

Closed
absol13 opened this issue Jul 4, 2023 · 7 comments
Closed
Labels
bug Something isn't working

Comments

@absol13
Copy link

absol13 commented Jul 4, 2023

Describe the bug
Hello, I found distributed training with the setting "model-parallel-size": >1 hangs. This situation appears in the source with the PR #958 is merged, and it did not appear in older sources or with the setting "model-parallel-size": 1 at all.

To Reproduce
I share my config file below for reproduction.

Expected behavior
Training should proceed further.

Proposed solution

Screenshots
Specifically, training does not proceed at this point:

gpt-neox-train-dist-master: time (ms) | model and optimizer: 49512.12 | train/valid/test data iterators: 1488.60
gpt-neox-train-dist-master: training ...
gpt-neox-train-dist-master: [2023-07-04 15:47:30,793] [INFO] [checkpointing.py:553:forward] Activation Checkpointing Information
gpt-neox-train-dist-master: [2023-07-04 15:47:30,793] [INFO] [checkpointing.py:554:forward] ----Partition Activations True, CPU CHECKPOINTING False
gpt-neox-train-dist-master: [2023-07-04 15:47:30,793] [INFO] [checkpointing.py:557:forward] ----contiguous Memory Checkpointing False with 32 total layers
gpt-neox-train-dist-master: [2023-07-04 15:47:30,793] [INFO] [checkpointing.py:560:forward] ----Synchronization True
gpt-neox-train-dist-master: [2023-07-04 15:47:30,793] [INFO] [checkpointing.py:561:forward] ----Profiling time in checkpointing False

Also, I report logs from nvidia-smi command.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.08    Driver Version: 510.73.08    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:07:00.0 Off |                    0 |
| N/A   38C    P0    95W / 400W |  15699MiB / 81920MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  On   | 00000000:0B:00.0 Off |                    0 |
| N/A   37C    P0    91W / 400W |  15247MiB / 81920MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  On   | 00000000:48:00.0 Off |                    0 |
| N/A   37C    P0    99W / 400W |  15723MiB / 81920MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM...  On   | 00000000:4C:00.0 Off |                    0 |
| N/A   39C    P0    97W / 400W |  15259MiB / 81920MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM...  On   | 00000000:88:00.0 Off |                    0 |
| N/A   36C    P0    93W / 400W |  15723MiB / 81920MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-SXM...  On   | 00000000:8B:00.0 Off |                    0 |
| N/A   40C    P0    93W / 400W |  15259MiB / 81920MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-SXM...  On   | 00000000:C8:00.0 Off |                    0 |
| N/A   39C    P0   105W / 400W |  15699MiB / 81920MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-SXM...  On   | 00000000:CB:00.0 Off |                    0 |
| N/A   37C    P0    90W / 400W |  15247MiB / 81920MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

In the normal training procedure, GPU power usage reaches near its capacity, but I observed that it is far from its capacity which seems to mean the hanging process.

Environment (please complete the following information):

  • GPUs: 2 nodes with 8 NVIDIA A100 GPUs with NVLINK, connected with IB.
  • Configs:
    I share my config file for training to help reproducing this situation.
{

  "hostfile": "---",
  # Tokenizer /  checkpoint settings - you will need to change these to the location you have them saved in
  "vocab-file": "---/gptneox/tokenizer.json",
  "save": "./ckpnt",
  "load": "./ckpnt",

  # If finetuning, edit the following to the location of your finetuning dataset:
  "data-path": "---/dataset/pile_00/pile_00_text_document",

  # parallelism settings ( you will want to change these based on your cluster setup, ideally scheduling pipeline stages
  # across the node boundaries )
  "pipe-parallel-size": 1,
  "model-parallel-size": 2, 
  #"make_vocab_size_divisible_by": 1,

  # model settings
  "num-layers": 32,
  "hidden-size": 4096,
  "num-attention-heads": 32,
  "seq-length": 2048,
  "max-position-embeddings": 2048,
  "norm": "rmsnorm",
  "rms_norm_epsilon": 1.0e-6,
  "pos-emb": "alibi",
  "no-weight-tying": true,
  "attention-config": [[["flash"], 32]],
  "gpt_j_residual": false,
  "output_layer_parallelism": "column",

  "scaled-upper-triang-masked-softmax-fusion": true,
  "bias-gelu-fusion": false,
  #"use_bias_in_norms": false,
  #"use_bias_in_attn_linear": false,
  #"mlp_type": "llama",
  #"activation": "silu",

  # init methods
  "init_method": "small_init",
  "output_layer_init_method": "wang_init",

  # optimizer settings
  "optimizer": {
    "type": "Adam",
    "params": {
      "lr": 0.00012,
      "betas": [0.9, 0.95],
      "eps": 1.0e-8,
      }
  },

  "min_lr": 0.00012,

  # for all zero_optimization options, see https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training
  "zero_optimization": {
  "stage": 1,
  "allgather_partitions": True,
  "allgather_bucket_size": 1260000000,
  "overlap_comm": True,
  "reduce_scatter": True,
  "reduce_bucket_size": 1260000000,
  "contiguous_gradients": True,
  },

  # batch / data settings (assuming 96 GPUs)
  "train_micro_batch_size_per_gpu": 1,
  "gradient_accumulation_steps": 1,
  "data-impl": "mmap",
  "split": "995,4,1",

  # activation checkpointing
  "checkpoint-activations": true,
  "checkpoint-num-layers": 1,
  "partition-activations": true,
  "synchronize-each-layer": true,

  # regularization
  "gradient_clipping": 1.0,
  "weight-decay": 0.01,
  "hidden-dropout": 0,
  "attention-dropout": 0,

  # precision settings
  "fp16": {
    "enabled": true,
    "type": "bfloat16",
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "initial_scale_power": 12,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "fp32_allreduce": True,

  # misc. training settings
  "train-iters": 150,
  "lr-decay-iters": 150000,

  "distributed-backend": "nccl",
  "lr-decay-style": "cosine",
  "warmup": 0.01,
  "checkpoint-factor": 500, # this variable previously called `save-interval`
  "eval-interval": 1000,
  "eval-iters": 10,

  # logging
  "log-interval": 1,
  "steps_per_print": 1,
  "keep-last-n-checkpoints": 1,


  ### NEW DATA: ####
  "tokenizer_type": "HFTokenizer",
  "tensorboard-dir": "./tensorboard",
  "log-dir": "./logs",

Additional context
Add any other context about the problem here.

@absol13 absol13 added the bug Something isn't working label Jul 4, 2023
@StellaAthena
Copy link
Member

@honglu2875

@honglu2875
Copy link
Contributor

honglu2875 commented Jul 4, 2023

@absol13 Would #979 fix the issue? It was my oversight that I didn't realize data could be None. It could be possible that it hangs because some process errored out due to this bug.

@absol13
Copy link
Author

absol13 commented Jul 5, 2023

@honglu2875 I don't know the exact reason. Anyway, I will try distributed training after #979 is merged and let you know this bug is solved. Thanks.

@StellaAthena
Copy link
Member

I have reproduced this issue and confirmed that #979 does not fix it.

@honglu2875
Copy link
Contributor

honglu2875 commented Jul 6, 2023

I tried to set up neox from an empty conda env again, and used my own config (pythia-1b) but made sure

  • it was run on 2 full nodes
  • pp=1 and mp=2, zero stage 1

On the main branch it just errored out on the first training step instead of hanging. Applying #979, it can train normally (watched it for about 10 training steps).

Will take a look at other items in OP's config later. But hope this helps narrow down the problem.

@StellaAthena
Copy link
Member

After conferring with @honglu2875 we discovered that I was failing to apply the fix to both nodes. @absol13 It should now work for you on main

@absol13
Copy link
Author

absol13 commented Jul 10, 2023

Now it works correctly. Thanks to your fast support.

@absol13 absol13 closed this as completed Jul 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants