-
Notifications
You must be signed in to change notification settings - Fork 982
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ZeRO 2 cpu_offload causes RuntimeError: expected input to be on cuda #478
Comments
I notice that you do not explicitly mention DeepSpeed or DeeperSpeed. As GPT-NeoX relies on DeeperSpeed for ZeRO and the rest of its features for distributed training, it is an important piece of the puzzle. DeeperSpeed should be installed from ./requirements/requirements.txt, so I would verify that install completes correctly and test ZeRO 2 on some other code so we can be sure this is a GPT-NeoX problem rather than a DeeperSpeed problem.
I would strongly suggest using our prebuilt images or building your own image from the included Dockerfile—all of the critical dependencies will be preinstalled and ready to use. It is a lot easier to use our verified environment than to install the dependencies from scratch yourself. |
Thanks for the reply! I installed directly from the requirements.txt files, so DeeperSpeed was installed. I've trained using ZeRO 1 without issue; the error only occurs once I've made the following two config changes to small.yml:
I gave the docker container referenced in the README a try and got the same issue: git clone https://github.com/EleutherAI/gpt-neox.git
cd gpt-neox
nvidia-docker run --rm -it -e NVIDIA_VISIBLE_DEVICES=0,1,2,3 --shm-size=1g --ulimit memlock=-1 --mount type=bind,src=$PWD,dst=/gpt-neox leogao2/gpt-neox:sha-63e10af
cd /gpt-neox
python prepare_data.py enron
# Confirm default config works
python deepy.py train.py -d configs/ small.yml local_setup.yml
# Got error saying to run the below command (sudo -s added by me to correct permission denied errors)
sudo -s
python /gpt-neox/megatron/fused_kernels/setup.py install
# Rerun the original command
python deepy.py train.py -d configs/ small.yml local_setup.yml
# Training ran without issue. CTRL+Ced out after 60 steps
# Enable ZeRO 2 cpu_offload
cp configs/small.yml configs/small_offload.yml
sed -i 's/"stage": 0/"stage": 2/g' configs/small_offload.yml
sed -i 's/"cpu_offload": False/"cpu_offload": True/g' configs/small_offload.yml
# Run with CPU offload
python deepy.py train.py -d configs/ small_offload.yml local_setup.yml
# The above command throws "RuntimeError: expected input to be on cuda" Full TracebackNeoXArgs.from_ymls() ['configs/small_offload.yml', 'configs/local_setup.yml'] INFO:root:NeoXArgs.calculate_derived() Total number of GPUs determined to be: 4 -------------------- arguments -------------------- attention_config ................ ['global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global', 'global']updated attention_dropout ............... 0.0.........................updated batch_size ...................... 4...........................updated checkpoint_activations .......... True........................updated clip_grad ....................... 1.0.........................updated config_files .................... {'small_offload.yml': '# GPT-2 pretraining setup\n{\n # parallelism settings ( you will want to change these based on your cluster setup, ideally scheduling pipeline stages\n # across the node boundaries )\n "pipe-parallel-size": 1,\n "model-parallel-size": 1,\n\n # model settings\n "num-layers": 12,\n "hidden-size": 768,\n "num-attention-heads": 12,\n "seq-length": 2048,\n "max-position-embeddings": 2048,\n "norm": "layernorm",\n "pos-emb": "rotary",\n "no-weight-tying": true,\n\n # these should provide some speedup but takes a while to build, set to true if desired\n "scaled-upper-triang-masked-softmax-fusion": false,\n "bias-gelu-fusion": false,\n\n\n # optimizer settings\n "optimizer": {\n "type": "Adam",\n "params": {\n "lr": 0.0006,\n "betas": [0.9, 0.999],\n "eps": 1.0e-8,\n }\n },\n "zero_optimization": {\n "stage": 2,\n "allgather_partitions": True,\n "allgather_bucket_size": 500000000,\n "overlap_comm": True,\n "reduce_scatter": True,\n "reduce_bucket_size": 500000000,\n "contiguous_gradients": True,\n "cpu_offload": True\n },\n\n # batch / data settings\n "train_micro_batch_size_per_gpu": 4,\n "data-impl": "mmap",\n "split": "949,50,1",\n\n # activation checkpointing\n "checkpoint-activations": true,\n "checkpoint-num-layers": 1,\n "partition-activations": true,\n "synchronize-each-layer": true,\n\n # regularization\n "gradient_clipping": 1.0,\n "weight-decay": 0.0,\n "hidden-dropout": 0.0,\n "attention-dropout": 0.0,\n\n # precision settings\n "fp16": { \n "enabled": true,\n "loss_scale": 0,\n "loss_scale_window": 1000,\n "hysteresis": 2,\n "min_loss_scale": 1\n },\n\n # misc. training settings\n "train-iters": 320000,\n "lr-decay-iters": 320000,\n "distributed-backend": "nccl",\n "lr-decay-style": "cosine",\n "warmup": 0.01,\n "save-interval": 10000,\n "eval-interval": 1000,\n "eval-iters": 10,\n\n # logging\n "log-interval": 100,\n "steps_per_print": 10,\n "keep-last-n-checkpoints": 4,\n "wall_clock_breakdown": true,\n}\n', 'local_setup.yml': '# Suggested data paths when using GPT-NeoX locally\n{\n "data-path": "data/enron/enron_text_document",\n \n # or for weighted datasets: \n # "train-data-paths": ["data/enron/enron_text_document", "data/enron/enron_text_document"],\n # "test-data-paths": ["data/enron/enron_text_document", "data/enron/enron_text_document"],\n # "valid-data-paths": ["data/enron/enron_text_document", "data/enron/enron_text_document"],\n # "train-data-weights": [1., 2.],\n # "test-data-weights": [2., 1.],\n # "valid-data-weights": [0.5, 0.4],\n\n # If weight_by_num_documents is True, Builds dataset weights from a multinomial distribution over groups of data according to the number of documents in each group. \n # WARNING: setting this to True will override any user provided weights\n # "weight_by_num_documents": false,\n # "weighted_sampler_alpha": 0.3,\n\n "vocab-file": "data/gpt2-vocab.json",\n "merge-file": "data/gpt2-merges.txt",\n\n "save": "checkpoints",\n "load": "checkpoints",\n "checkpoint_validation_with_forward_pass": False,\n \n "tensorboard-dir": "tensorboard",\n "log-dir": "logs",\n "use_wandb": True,\n "wandb_host": "https://api.wandb.ai",\n "wandb_project": "neox"\n}'}updated data_impl ....................... mmap........................updated data_path ....................... data/enron/enron_text_documentupdated dynamic_loss_scale .............. True........................updated eval_iters ...................... 10..........................updated fp16 ............................ {'enabled': True, 'loss_scale': 0, 'loss_scale_window': 1000, 'hysteresis': 2, 'min_loss_scale': 1}updated gas ............................. 1...........................updated global_num_gpus ................. 4...........................updated gradient_clipping ............... 1.0.........................updated hidden_dropout .................. 0.0.........................updated hidden_size ..................... 768.........................updated is_pipe_parallel ................ True........................updated keep_last_n_checkpoints ......... 4...........................updated load ............................ checkpoints.................updated log_dir ......................... logs........................updated log_interval .................... 100.........................updated lr .............................. 0.0006......................updated lr_decay_iters .................. 320000......................updated lr_decay_style .................. cosine......................updated max_position_embeddings ......... 2048........................updated merge_file ...................... data/gpt2-merges.txt........updated no_weight_tying ................. True........................updated num_attention_heads ............. 12..........................updated num_layers ...................... 12..........................updated optimizer ....................... {'type': 'Adam', 'params': {'lr': 0.0006, 'betas': [0.9, 0.999], 'eps': 1e-08}}updated optimizer_type .................. Adam........................updated partition_activations ........... True........................updated pipe_parallel_size .............. 1...........................updated pos_emb ......................... rotary......................updated precision ....................... fp16........................updated save ............................ checkpoints.................updated save_interval ................... 10000.......................updated seq_length ...................... 2048........................updated sparsity_config ................. {}..........................updated split ........................... 949,50,1....................updated synchronize_each_layer .......... True........................updated tensorboard_dir ................. tensorboard.................updated train_batch_size ................ 16..........................updated train_iters ..................... 320000......................updated train_micro_batch_size_per_gpu .. 4...........................updated use_wandb ....................... True........................updated user_script ..................... train.py....................updated vocab_file ...................... data/gpt2-vocab.json........updated wall_clock_breakdown ............ True........................updated wandb_group ..................... QZUNHRmPyM5gy9C2ReHsUj_1fcswh3eupdated weight_decay .................... 0.0.........................updated zero_allgather_bucket_size ...... 500000000...................updated zero_contiguous_gradients ....... True........................updated zero_optimization ............... {'stage': 2, 'allgather_partitions': True, 'allgather_bucket_size': 500000000, 'overlap_comm': True, 'reduce_scatter': True, 'reduce_bucket_size': 500000000, 'contiguous_gradients': True, 'cpu_offload': True}updated zero_reduce_bucket_size ......... 500000000...................updated zero_reduce_scatter ............. True........................updated zero_stage ...................... 2...........................updated activation ...................... gelu........................default adlr_autoresume ................. False.......................default adlr_autoresume_interval ........ 1000........................default amp ............................. None........................default apply_query_key_layer_scaling ... False.......................default attention_softmax_in_fp32 ....... False.......................default bias_dropout_fusion ............. False.......................default bias_gelu_fusion ................ False.......................default char_level_ppl .................. False.......................default checkpoint_in_cpu ............... False.......................default checkpoint_num_layers ........... 1...........................default checkpoint_validation_with_forward_pass False................default contiguous_checkpointing ........ False.......................default deepscale ....................... False.......................default deepscale_config ................ None........................default deepspeed ....................... True........................default deepspeed_activation_checkpointing True......................default deepspeed_mpi ................... False.......................default detect_nvlink_pairs ............. False.......................default distributed_backend ............. nccl........................default do_test ......................... None........................default do_train ........................ None........................default do_valid ........................ None........................default dump_state ...................... False.......................default eod_mask_loss ................... False.......................default eval_interval ................... 1000........................default eval_results_prefix ............. ............................default eval_tasks ...................... None........................default exclude ......................... None........................default exit_interval ................... None........................default finetune ........................ False.......................default flops_profiler .................. None........................default fp16_lm_cross_entropy ........... False.......................default fp32_allreduce .................. False.......................default git_hash ........................ 49e60fe.....................default gmlp_attn_dim ................... 64..........................default gpt_j_residual .................. False.......................default gradient_accumulation_steps ..... 1...........................default gradient_noise_scale_cpu_offload False.......................default gradient_noise_scale_n_batches .. 5...........................default gradient_predivide_factor ....... 1.0.........................default hostfile ........................ None........................default hysteresis ...................... 2...........................default include ......................... None........................default init_method ..................... normal......................default init_method_std ................. 0.02........................default iteration ....................... None........................default launcher ........................ pdsh........................default layernorm_epsilon ............... 1e-05.......................default lazy_mpu_init ................... False.......................default local_rank ...................... None........................default log_grad_norm ................... False.......................default log_gradient_noise_scale ........ False.......................default log_optimizer_states ............ False.......................default log_param_norm .................. False.......................default loss_scale ...................... None........................default loss_scale_window ............... 1000.0......................default make_vocab_size_divisible_by .... 128.........................default master_addr ..................... None........................default master_port ..................... 29500.......................default maximum_tokens .................. 64..........................default min_lr .......................... 0.0.........................default min_scale ....................... 1.0.........................default mmap_warmup ..................... False.......................default model_parallel_size ............. 1...........................default no_load_optim ................... False.......................default no_load_rng ..................... False.......................default no_save_optim ................... False.......................default no_save_rng ..................... False.......................default norm ............................ layernorm...................default num_gpus ........................ None........................default num_nodes ....................... -1..........................default num_samples ..................... 0...........................default num_unique_layers ............... None........................default num_workers ..................... 2...........................default onnx_safe ....................... False.......................default output_layer_init_method ........ scaled_normal...............default output_layer_parallelism ........ row.........................default override_lr_scheduler ........... False.......................default padded_vocab_size ............... None........................default param_sharing_style ............. grouped.....................default pipe_partition_method ........... type:transformer|mlp........default prescale_gradients .............. False.......................default profile_backward ................ False.......................default rank ............................ None........................default recompute ....................... False.......................default rms_norm_epsilon ................ 1e-08.......................default rotary_emb_base ................. 10000.......................default rotary_pct ...................... 1.0.........................default rpe_max_distance ................ 128.........................default rpe_num_buckets ................. 32..........................default sample_input_file ............... None........................default sample_output_file .............. None........................default scaled_masked_softmax_fusion .... False.......................default scaled_upper_triang_masked_softmax_fusion False..............default scalenorm_epsilon ............... 1e-08.......................default scheduler ....................... None........................default seed ............................ 1234........................default short_seq_prob .................. 0.1.........................default soft_prompt_tuning .............. None........................default sparse_gradients ................ False.......................default steps_per_print ................. 10..........................default temperature ..................... 0.0.........................default test_data_paths ................. None........................default test_data_weights ............... None........................default text_gen_type ................... None........................default tokenizer_type .................. GPT2BPETokenizer............default top_k ........................... 0...........................default top_p ........................... 0.0.........................default train_data_paths ................ None........................default train_data_weights .............. None........................default use_bnb_optimizer ............... False.......................default use_checkpoint_lr_scheduler ..... False.......................default use_cpu_initialization .......... False.......................default valid_data_paths ................ None........................default valid_data_weights .............. None........................default wandb_host ...................... https://api.wandb.ai........default wandb_project ................... neox........................default wandb_team ...................... None........................default warmup .......................... 0.01........................default weight_by_num_documents ......... False.......................default weighted_sampler_alpha .......... 0.3.........................default world_size ...................... None........................default zero_allow_untested_optimizer ... False.......................default ---------------- end of arguments ---------------- [2021-12-11 00:42:41,060] [WARNING] [runner.py:126:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2021-12-11 00:42:41,060] [INFO] [runner.py:366:main] cmd = /usr/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=29500 train.py --deepspeed_config {"train_batch_size": 16, "train_micro_batch_size_per_gpu": 4, "optimizer": {"type": "Adam", "params": {"lr": 0.0006, "betas": [0.9, 0.999], "eps": 1e-08}}, "fp16": {"enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1}, "gradient_clipping": 1.0, "zero_optimization": {"stage": 2, "allgather_partitions": true, "allgather_bucket_size": 500000000, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 500000000, "contiguous_gradients": true, "cpu_offload": true}, "wall_clock_breakdown": true} --megatron_config {"train_batch_size": 16, "train_micro_batch_size_per_gpu": 4, "optimizer": {"type": "Adam", "params": {"lr": 0.0006, "betas": [0.9, 0.999], "eps": 1e-08}}, "fp16": {"enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1}, "gradient_clipping": 1.0, "zero_optimization": {"stage": 2, "allgather_partitions": true, "allgather_bucket_size": 500000000, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 500000000, "contiguous_gradients": true, "cpu_offload": true}, "wall_clock_breakdown": true, "precision": "fp16", "num_layers": 12, "hidden_size": 768, "num_attention_heads": 12, "seq_length": 2048, "max_position_embeddings": 2048, "pos_emb": "rotary", "no_weight_tying": true, "attention_config": ["global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global", "global"], "sparsity_config": {}, "lr_decay_style": "cosine", "lr_decay_iters": 320000, "optimizer_type": "Adam", "zero_stage": 2, "zero_reduce_scatter": true, "zero_contiguous_gradients": true, "zero_reduce_bucket_size": 500000000, "zero_allgather_bucket_size": 500000000, "lr": 0.0006, "data_path": "data/enron/enron_text_document", "data_impl": "mmap", "save": "checkpoints", "config_files": {"small_offload.yml": "# GPT-2 pretraining setup\n{\n # parallelism settings ( you will want to change these based on your cluster setup, ideally scheduling pipeline stages\n # across the node boundaries )\n \"pipe-parallel-size\": 1,\n \"model-parallel-size\": 1,\n\n # model settings\n \"num-layers\": 12,\n \"hidden-size\": 768,\n \"num-attention-heads\": 12,\n \"seq-length\": 2048,\n \"max-position-embeddings\": 2048,\n \"norm\": \"layernorm\",\n \"pos-emb\": \"rotary\",\n \"no-weight-tying\": true,\n\n # these should provide some speedup but takes a while to build, set to true if desired\n \"scaled-upper-triang-masked-softmax-fusion\": false,\n \"bias-gelu-fusion\": false,\n\n\n # optimizer settings\n \"optimizer\": {\n \"type\": \"Adam\",\n \"params\": {\n \"lr\": 0.0006,\n \"betas\": [0.9, 0.999],\n \"eps\": 1.0e-8,\n }\n },\n \"zero_optimization\": {\n \"stage\": 2,\n \"allgather_partitions\": True,\n \"allgather_bucket_size\": 500000000,\n \"overlap_comm\": True,\n \"reduce_scatter\": True,\n \"reduce_bucket_size\": 500000000,\n \"contiguous_gradients\": True,\n \"cpu_offload\": True\n },\n\n # batch / data settings\n \"train_micro_batch_size_per_gpu\": 4,\n \"data-impl\": \"mmap\",\n \"split\": \"949,50,1\",\n\n # activation checkpointing\n \"checkpoint-activations\": true,\n \"checkpoint-num-layers\": 1,\n \"partition-activations\": true,\n \"synchronize-each-layer\": true,\n\n # regularization\n \"gradient_clipping\": 1.0,\n \"weight-decay\": 0.0,\n \"hidden-dropout\": 0.0,\n \"attention-dropout\": 0.0,\n\n # precision settings\n \"fp16\": { \n \"enabled\": true,\n \"loss_scale\": 0,\n \"loss_scale_window\": 1000,\n \"hysteresis\": 2,\n \"min_loss_scale\": 1\n },\n\n # misc. training settings\n \"train-iters\": 320000,\n \"lr-decay-iters\": 320000,\n \"distributed-backend\": \"nccl\",\n \"lr-decay-style\": \"cosine\",\n \"warmup\": 0.01,\n \"save-interval\": 10000,\n \"eval-interval\": 1000,\n \"eval-iters\": 10,\n\n # logging\n \"log-interval\": 100,\n \"steps_per_print\": 10,\n \"keep-last-n-checkpoints\": 4,\n \"wall_clock_breakdown\": true,\n}\n", "local_setup.yml": "# Suggested data paths when using GPT-NeoX locally\n{\n \"data-path\": \"data/enron/enron_text_document\",\n \n # or for weighted datasets: \n # \"train-data-paths\": [\"data/enron/enron_text_document\", \"data/enron/enron_text_document\"],\n # \"test-data-paths\": [\"data/enron/enron_text_document\", \"data/enron/enron_text_document\"],\n # \"valid-data-paths\": [\"data/enron/enron_text_document\", \"data/enron/enron_text_document\"],\n # \"train-data-weights\": [1., 2.],\n # \"test-data-weights\": [2., 1.],\n # \"valid-data-weights\": [0.5, 0.4],\n\n # If weight_by_num_documents is True, Builds dataset weights from a multinomial distribution over groups of data according to the number of documents in each group. \n # WARNING: setting this to True will override any user provided weights\n # \"weight_by_num_documents\": false,\n # \"weighted_sampler_alpha\": 0.3,\n\n \"vocab-file\": \"data/gpt2-vocab.json\",\n \"merge-file\": \"data/gpt2-merges.txt\",\n\n \"save\": \"checkpoints\",\n \"load\": \"checkpoints\",\n \"checkpoint_validation_with_forward_pass\": False,\n \n \"tensorboard-dir\": \"tensorboard\",\n \"log-dir\": \"logs\",\n \"use_wandb\": True,\n \"wandb_host\": \"https://api.wandb.ai\",\n \"wandb_project\": \"neox\"\n}"}, "load": "checkpoints", "save_interval": 10000, "batch_size": 4, "train_iters": 320000, "eval_iters": 10, "keep_last_n_checkpoints": 4, "split": "949,50,1", "vocab_file": "data/gpt2-vocab.json", "merge_file": "data/gpt2-merges.txt", "attention_dropout": 0.0, "hidden_dropout": 0.0, "weight_decay": 0.0, "checkpoint_activations": true, "synchronize_each_layer": true, "partition_activations": true, "gas": 1, "clip_grad": 1.0, "dynamic_loss_scale": true, "pipe_parallel_size": 1, "is_pipe_parallel": true, "use_wandb": true, "wandb_group": "QZUNHRmPyM5gy9C2ReHsUj_1fcswh3e", "log_dir": "logs", "tensorboard_dir": "tensorboard", "log_interval": 100, "user_script": "train.py", "global_num_gpus": 4} [2021-12-11 00:42:42,055] [INFO] [launch.py:82:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]} [2021-12-11 00:42:42,055] [INFO] [launch.py:88:main] nnodes=1, num_local_procs=4, node_rank=0 [2021-12-11 00:42:42,055] [INFO] [launch.py:103:main] global_rank_mapping=defaultdict(, {'localhost': [0, 1, 2, 3]}) [2021-12-11 00:42:42,055] [INFO] [launch.py:104:main] dist_world_size=4 [2021-12-11 00:42:42,055] [INFO] [launch.py:112:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3 NeoXArgs.configure_distributed_args() using world size: 4 and model-parallel size: 1 > building GPT2BPETokenizer tokenizer ... [2021-12-11 00:42:45,229] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl [2021-12-11 00:42:45,233] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl [2021-12-11 00:42:45,240] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304) WARNING: TensorBoard writing requested but is not available (are you using PyTorch 1.1.0 or later and do you have tensorboard installed?), no TensorBoard logs will be written. > initializing torch distributed ... [2021-12-11 00:42:45,284] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl > initializing model parallel with size 1 MPU DP: [0, 1, 2, 3] MPU PP: [0] MPU PP: [1] MPU PP: [2] MPU PP: [3] MPU MP: [0] MPU MP: [1] MPU MP: [2] MPU MP: [3] > setting random seeds to 1234 ... [2021-12-11 00:42:46,317] [INFO] [checkpointing.py:223:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234 make: Entering directory '/gpt-neox/megatron/data' make: Nothing to be done for 'default'. make: Leaving directory '/gpt-neox/megatron/data' building GPT2 model ... SEED_LAYERS=False BASE_SEED=1234 SEED_FN=None Using topology: {ProcessCoord(pipe=0, data=0, model=0): 0, ProcessCoord(pipe=0, data=1, model=0): 1, ProcessCoord(pipe=0, data=2, model=0): 2, ProcessCoord(pipe=0, data=3, model=0): 3} [2021-12-11 00:42:50,771] [INFO] [module.py:363:_partition_layers] Partitioning pipeline stages with method type:transformer|mlp stage=0 layers=17 0: EmbeddingPipe 1: _pre_transformer_block 2: ParallelTransformerLayerPipe 3: ParallelTransformerLayerPipe 4: ParallelTransformerLayerPipe 5: ParallelTransformerLayerPipe 6: ParallelTransformerLayerPipe 7: ParallelTransformerLayerPipe 8: ParallelTransformerLayerPipe 9: ParallelTransformerLayerPipe 10: ParallelTransformerLayerPipe 11: ParallelTransformerLayerPipe 12: ParallelTransformerLayerPipe 13: ParallelTransformerLayerPipe 14: _post_transformer_block 15: NormPipe 16: ParallelLinearPipe loss: partial [2021-12-11 00:42:50,808] [WARNING] [config.py:77:_sanity_check] DeepSpeedConfig: cpu_offload is deprecated. Please use offload_optimizer. Configuring Optimizer type: Adam with params: {'lr': 0.0006, 'betas': [0.9, 0.999], 'eps': 1e-08} > learning rate decay style: cosine DeepSpeed is enabled. [2021-12-11 00:42:50,810] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed info: version=0.3.15+eb7f5cf, git-hash=eb7f5cf, git-branch=main [2021-12-11 00:42:50,811] [WARNING] [config.py:77:_sanity_check] DeepSpeedConfig: cpu_offload is deprecated. Please use offload_optimizer. [2021-12-11 00:42:50,813] [WARNING] [config.py:77:_sanity_check] DeepSpeedConfig: cpu_offload is deprecated. Please use offload_optimizer. [2021-12-11 00:42:50,814] [WARNING] [config.py:77:_sanity_check] DeepSpeedConfig: cpu_offload is deprecated. Please use offload_optimizer. [2021-12-11 00:42:52,185] [INFO] [engine.py:654:_configure_optimizer] Removing param_group that has no 'params' in the client Optimizer [2021-12-11 00:42:52,186] [INFO] [engine.py:659:_configure_optimizer] Using client Optimizer as basic optimizer [2021-12-11 00:42:52,186] [INFO] [engine.py:668:_configure_optimizer] DeepSpeed Basic Optimizer = FusedAdam Checking ZeRO support for optimizer=FusedAdam type= [2021-12-11 00:42:52,186] [INFO] [logging.py:60:log_dist] [Rank 0] Creating fp16 ZeRO stage 2 optimizer [2021-12-11 00:42:52,186] [INFO] [stage2.py:102:__init__] Reduce bucket size 500000000 [2021-12-11 00:42:52,186] [INFO] [stage2.py:103:__init__] Allgather bucket size 500000000 [2021-12-11 00:42:52,186] [INFO] [stage2.py:104:__init__] CPU Offload: True Using /root/.cache/torch_extensions as PyTorch extensions root... Using /root/.cache/torch_extensions as PyTorch extensions root... Using /root/.cache/torch_extensions as PyTorch extensions root... Using /root/.cache/torch_extensions as PyTorch extensions root... Emitting ninja build file /root/.cache/torch_extensions/utils/build.ninja... Building extension module utils... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module utils... Time to load utils op: 0.35228919982910156 seconds Loading extension module utils...Loading extension module utils... I don't have time right now, but in the next couple weeks I might try to make a simple feed forward network, then I can wrap it in DeeperSpeed ZeRO 2 with CPU offload to try and isolate the issue to either GPT-NeoX or DeeperSpeed. Has |
@pwstegman If you set Pipeline parallelism and ZeRO 2 are incompatible, so if it’s a one-stage pipeline module that could cause the problem. The printout says it’s a pipeline module, but off the top of my head I’m not 100% sure that we put the print statement in the right place relative to the casting to non-pipeline :P |
Also, just to confirm, the configs you’re describing are a minimal example or a test to make sure things work right? This is a very poor way to actually train a 125M model on your hardware. You should not need CPU offload unless you’re trying to train a model with at least 40B parameters. |
Pipeline parallelism and Zero 2/3 are not compatible then? How would one train a large model say 20B parameters without pipeline parallelism while using CPU offload, even if just a single step? Using Zero 2/3 I have encountered OOM issues with a 3090 on smaller models, though still higher than without using those higher-level Zero, not on this repo per say, but in other areas such as finetuning GPTJ. I have finetuned GPTJ on GPUs using CPU offloading and a 3090, but trying to train a model with the 6B config here without offloading leads to OOM issues. I have encountered the same issue as this task highlights. "RuntimeError: expected input to be on cuda", When trying to enable CPU offloading. Without changing pipeline parallelism from 1 to 0, my RAM usage increases as if properly working before encountering that issue. After changing to 0, the RAM does not increase as expected and I get an OOM error promptly. |
@StellaAthena Just tried setting pipeline parallelism to 0 (model parallel was 1) and got the same error. Traceback
Appreciate the heads up. Yeah, I'm just using the small config to debug CPU offload. I don't intend to use it to train a model that small. CPU offload has worked in the past with model parallel on the latest DeepSpeed using https://github.com/microsoft/DeepSpeedExamples/tree/master/Megatron-LM-v1.1.5-ZeRO3. I don't have steps to reproduce on me right now. @mallorbc You can use model parallelism, which does work with cpu offload (at least, it should once we sort this issue out). It also divides the model out across multiple GPUs and is even more memory efficient than pipeline parallelism (at the cost of computation time). (source: https://www.microsoft.com/en-us/research/blog/deepspeed-extreme-scale-model-training-for-everyone/, "Understand the tradeoffs of data, model, and pipeline parallelism" section) |
Describe the bug
When starting with
small.yml
, then changing ZeRO to 2 andcpu_offload
totrue
, I get the following error:Similar to #171.
Full Traceback
To Reproduce
Starting on a host with NVIDIA driver 460.91.03, I ran a new docker container using image
nvidia/cuda:11.1.1-devel-ubuntu18.04
. I created a new Python 3.8.0 virtual env, installed torch 1.8.2+cu111, installed the latest NVIDIA apex, then installed all the requirements in each file within the requirements directory of GPT-NeoX.I changed two lines in the
small.yml
config in the"zero_optimization"
section:"stage": 2
"cpu_offload": True
Then I ran:
This caused the mentioned error.
Expected behavior
The optimizer should be offloaded to the CPU.
Proposed solution
No proposed solutions yet. Are there any known working
cpu_offload
configs? I can start there to try and debug this.Screenshots
None
Environment (please complete the following information):
small.yml
(modified)local_setup_small.yml
Additional context
None
The text was updated successfully, but these errors were encountered: