GLM4V微调OOM #1122

lyc728 · 2024-06-12T07:01:10Z

训练脚本

NPROC_PER_NODE=2 \
CUDA_VISIBLE_DEVICES=0,1 \
swift sft \
--model_type glm4v-9b-chat \
--model_id_or_path '/data/MLLM/GLM-4-main/pretrained/glm-4v-9b/' \
--dataset '/data/swift/13_conversation_1w.jsonl' \
--ddp_find_unused_parameters true \

报错信息


Traceback (most recent call last):
  File "/data/swift/swift/cli/sft.py", line 5, in <module>
    sft_main()
  File "/data/swift/swift/utils/run_utils.py", line 26, in x_main
    raise ValueError(f'remaining_argv: {remaining_argv}')
ValueError: remaining_argv: ['\\']
[INFO:swift] output_dir: /data/swift/output/glm4v-9b-chat/v1-20240612-145239
Traceback (most recent call last):
  File "/data/swift/swift/cli/sft.py", line 5, in <module>
    sft_main()
  File "/data/swift/swift/utils/run_utils.py", line 26, in x_main
    raise ValueError(f'remaining_argv: {remaining_argv}')
ValueError: remaining_argv: ['\\']
[2024-06-12 14:52:48,557] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 2486) of binary: /opt/conda/bin/python
Traceback (most recent call last):
  File "/opt/conda/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/data/swift/swift/cli/sft.py FAILED

The text was updated successfully, but these errors were encountered:

lyc728 · 2024-06-12T09:41:42Z

为什么跑到80步会超显存呢

CUDA_VISIBLE_DEVICES=4 swift sft \

--model_type glm4v-9b-chat
--model_id_or_path '/data/MLLM/GLM-4-main/pretrained/glm-4v-9b/'
--custom_train_dataset_path '/data/MLLM/data_archive/LVLM_label/KIE4GLM4V_test.json'
--dtype bf16
--gradient_accumulation_steps 1
--learning_rate 1e-6
--eval_steps 50000
--save_steps 1000
--logging_steps 1
--save_total_limit 1
--num_train_epochs 5
--output_dir us_desc/train0611

{'loss': 3.79492188, 'acc': 0.42389783, 'grad_norm': 5.71875, 'learning_rate': 7.8e-07, 'memory(GiB)': 38.62, 'train_speed(iter/s)': 0.74782, 'epoch': 0.2, 'global_step': 79}
{'loss': 4.12109375, 'acc': 0.39904147, 'grad_norm': 5.9375, 'learning_rate': 7.9e-07, 'memory(GiB)': 38.62, 'train_speed(iter/s)': 0.748184, 'epoch': 0.2, 'global_step': 80}
Train: 4%|██████▌ | 80/2010 [01:46<41:02, 1.28s/it]Traceback (most recent call last):
File "/data/swift/swift/cli/sft.py", line 5, in
sft_main()
File "/data/swift/swift/utils/run_utils.py", line 27, in x_main
result = llm_x(args, **kwargs)
File "/data/liuyuanchao/swift/swift/llm/sft.py", line 301, in llm_sft
trainer.train(training_args.resume_from_checkpoint)
File "/data/swift/swift/trainers/trainers.py", line 50, in train
res = super().train(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1624, in train
return inner_training_loop(
File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1961, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2911, in training_step
self.accelerator.backward(loss)
File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 1964, in backward
loss.backward(**kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward
torch.autograd.backward(
File "/opt/conda/lib/python3.10/site-packages/torch/autograd/init.py", line 251, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 706.00 MiB. GPU 0 has a total capacty of 39.39 GiB of which 522.81 MiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 37.08 GiB is allocated by PyTorch, and 1.28 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Jintao-Huang changed the title ~~GLM4V微调报错~~ GLM4V微调OOM Jun 12, 2024

Jintao-Huang self-assigned this Jun 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GLM4V微调OOM #1122

GLM4V微调OOM #1122

lyc728 commented Jun 12, 2024

lyc728 commented Jun 12, 2024

GLM4V微调OOM #1122

GLM4V微调OOM #1122

Comments

lyc728 commented Jun 12, 2024

lyc728 commented Jun 12, 2024