Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GLM4V微调OOM #1122

Open
lyc728 opened this issue Jun 12, 2024 · 1 comment
Open

GLM4V微调OOM #1122

lyc728 opened this issue Jun 12, 2024 · 1 comment
Assignees

Comments

@lyc728
Copy link

lyc728 commented Jun 12, 2024

训练脚本

NPROC_PER_NODE=2 \
CUDA_VISIBLE_DEVICES=0,1 \
swift sft \
--model_type glm4v-9b-chat \
--model_id_or_path '/data/MLLM/GLM-4-main/pretrained/glm-4v-9b/' \
--dataset '/data/swift/13_conversation_1w.jsonl' \
--ddp_find_unused_parameters true \

报错信息


Traceback (most recent call last):
  File "/data/swift/swift/cli/sft.py", line 5, in <module>
    sft_main()
  File "/data/swift/swift/utils/run_utils.py", line 26, in x_main
    raise ValueError(f'remaining_argv: {remaining_argv}')
ValueError: remaining_argv: ['\\']
[INFO:swift] output_dir: /data/swift/output/glm4v-9b-chat/v1-20240612-145239
Traceback (most recent call last):
  File "/data/swift/swift/cli/sft.py", line 5, in <module>
    sft_main()
  File "/data/swift/swift/utils/run_utils.py", line 26, in x_main
    raise ValueError(f'remaining_argv: {remaining_argv}')
ValueError: remaining_argv: ['\\']
[2024-06-12 14:52:48,557] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 2486) of binary: /opt/conda/bin/python
Traceback (most recent call last):
  File "/opt/conda/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/data/swift/swift/cli/sft.py FAILED
@lyc728
Copy link
Author

lyc728 commented Jun 12, 2024

为什么跑到80步会超显存呢

CUDA_VISIBLE_DEVICES=4 swift sft \

--model_type glm4v-9b-chat
--model_id_or_path '/data/MLLM/GLM-4-main/pretrained/glm-4v-9b/'
--custom_train_dataset_path '/data/MLLM/data_archive/LVLM_label/KIE4GLM4V_test.json'
--dtype bf16
--gradient_accumulation_steps 1
--learning_rate 1e-6
--eval_steps 50000
--save_steps 1000
--logging_steps 1
--save_total_limit 1
--num_train_epochs 5
--output_dir us_desc/train0611


{'loss': 3.79492188, 'acc': 0.42389783, 'grad_norm': 5.71875, 'learning_rate': 7.8e-07, 'memory(GiB)': 38.62, 'train_speed(iter/s)': 0.74782, 'epoch': 0.2, 'global_step': 79}
{'loss': 4.12109375, 'acc': 0.39904147, 'grad_norm': 5.9375, 'learning_rate': 7.9e-07, 'memory(GiB)': 38.62, 'train_speed(iter/s)': 0.748184, 'epoch': 0.2, 'global_step': 80}
Train: 4%|██████▌ | 80/2010 [01:46<41:02, 1.28s/it]Traceback (most recent call last):
File "/data/swift/swift/cli/sft.py", line 5, in
sft_main()
File "/data/swift/swift/utils/run_utils.py", line 27, in x_main
result = llm_x(args, **kwargs)
File "/data/liuyuanchao/swift/swift/llm/sft.py", line 301, in llm_sft
trainer.train(training_args.resume_from_checkpoint)
File "/data/swift/swift/trainers/trainers.py", line 50, in train
res = super().train(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1624, in train
return inner_training_loop(
File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1961, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2911, in training_step
self.accelerator.backward(loss)
File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 1964, in backward
loss.backward(**kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward
torch.autograd.backward(
File "/opt/conda/lib/python3.10/site-packages/torch/autograd/init.py", line 251, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 706.00 MiB. GPU 0 has a total capacty of 39.39 GiB of which 522.81 MiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 37.08 GiB is allocated by PyTorch, and 1.28 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

@Jintao-Huang Jintao-Huang changed the title GLM4V微调报错 GLM4V微调OOM Jun 12, 2024
@Jintao-Huang Jintao-Huang self-assigned this Jun 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants