只使用deepspeed不使用torchrun 训练报错 #1188

AlbertBJ · 2024-06-20T07:07:55Z

不使用torch的ddp分布式，因为需要在每个机器上都执行脚本，如果使用deepspeed的话，只需要在 master节点执行就行，

使用 torch ddp跑以下脚本
nproc_per_node=2

export PYTHONPATH=../../..
export CUDA_VISIBLE_DEVICES=0,1

torchrun
--nproc_per_node=$nproc_per_node
--master_port 29500
llm_sft.py
省略参数
--deepspeed xxxx.json
可以跑成功；

但是去除torchrun 只使用deepspeed就不行，目前还是在单机双卡测试：
export PYTHONPATH=../../..
export CUDA_VISIBLE_DEVICES=0,1
deepspeed llm_sft.py
其他参数和上面一致，运行报错：
File "/mnt/swift/examples/pytorch/llm/llm_sft.py", line 7, in
output = sft_main()
File "/mnt/swift/swift/utils/run_utils.py", line 21, in x_main
args, remaining_argv = parse_args(args_class, argv)
File "/mnt/swift/swift/utils/utils.py", line 114, in parse_args
args, remaining_args = parser.parse_args_into_dataclasses(argv, return_remaining_strings=True)
File "/opt/conda/lib/python3.10/site-packages/transformers/hf_argparser.py", line 338, in parse_args_into_dataclasses
obj = dtype(**inputs)
File "", line 172, in init
File "/mnt/swift/swift/llm/utils/argument.py", line 883, in post_init
raise ValueError('DeepSpeed is not compatible with MP. '
ValueError: DeepSpeed is not compatible with MP. n_gpu: 2, local_world_size: 1.
[INFO:swift] Using DDP + MP(device_map)
[INFO:swift] Using DDP + MP(device_map)
[INFO:swift] Successfully registered /mnt/swift/swift/llm/data/dataset_info.json

torch:2.1.2
swift版本：2.2.0.dev0
transformers:4.38.2
deepspeed info ................... 0.12.6
torch cuda version ............... 12.1

The text was updated successfully, but these errors were encountered:

AlbertBJ · 2024-06-20T08:39:16Z

在脚本中添加了
export LOCAL_WORLD_SIZE=2，
上面这个问题： DeepSpeed is not compatible with MP. 消失了，但是又有新的问题，报参数问题
raise ValueError(f'remaining_argv: {remaining_argv}')
ValueError: remaining_argv: ['--local_rank=0']
raise ValueError(f'remaining_argv: {remaining_argv}')
ValueError: remaining_argv: ['--local_rank=1']
[2024-06-20 16:36:39,296] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 7269
[2024-06-20 16:36:39,297] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 7270

因为ds启动训练会为每个进程的启动参数都增加一个 local_rank的参数，我看 swift中，会对这块校验，为什么呢？

AlbertBJ · 2024-06-20T08:43:33Z

又加了 ignore_args_error=True的参数，不报错了，看看能不能run

tastelikefeet · 2024-06-21T07:03:53Z

保证DDP数量和可见显卡数量一致，在DDP不等于显卡数量时，swift会使用device_map，这个技术和deepspeed不兼容

AlbertBJ · 2024-06-24T06:04:22Z

保证DDP数量和可见显卡数量一致，在DDP不等于显卡数量时，swift会使用device_map，这个技术和deepspeed不兼容

会支持 deepspeed么？还有就是我现在在单机上测试，ds，可以run起来，就是把 LOCAL_WORLD_SIZE设置成可见卡的数量，后续我再测试下多机多卡

AlbertBJ · 2024-06-24T06:11:47Z

保证DDP数量和可见显卡数量一致，在DDP不等于显卡数量时，swift会使用device_map，这个技术和deepspeed不兼容

多机多卡的时候，还是使用ds更方便一些

AlbertBJ · 2024-06-26T07:59:58Z

保证DDP数量和可见显卡数量一致，在DDP不等于显卡数量时，swift会使用device_map，这个技术和deepspeed不兼容

会支持 deepspeed么？还有就是我现在在单机上测试，ds，可以run起来，就是把 LOCAL_WORLD_SIZE设置成可见卡的数量，后续我再测试下多机多卡

使用ds 多机多卡启动失败，就算我强制把每个机器中环境变量LOCAL_WORLD_SIZE设置为可见卡数量，还是报 DeepSpeed is not compatible with MP；看来只能使用 torchrun方式了

AlbertBJ · 2024-06-26T08:01:05Z

保证DDP数量和可见显卡数量一致，在DDP不等于显卡数量时，swift会使用device_map，这个技术和deepspeed不兼容

后续会支持 ds么？感觉还是 ds好用

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

只使用deepspeed不使用torchrun 训练报错 #1188

只使用deepspeed不使用torchrun 训练报错 #1188

AlbertBJ commented Jun 20, 2024 •

edited

Loading

AlbertBJ commented Jun 20, 2024

AlbertBJ commented Jun 20, 2024

tastelikefeet commented Jun 21, 2024

AlbertBJ commented Jun 24, 2024

AlbertBJ commented Jun 24, 2024

AlbertBJ commented Jun 26, 2024

AlbertBJ commented Jun 26, 2024

只使用deepspeed不使用torchrun 训练 报错 #1188

只使用deepspeed不使用torchrun 训练 报错 #1188

Comments

AlbertBJ commented Jun 20, 2024 • edited Loading

AlbertBJ commented Jun 20, 2024

AlbertBJ commented Jun 20, 2024

tastelikefeet commented Jun 21, 2024

AlbertBJ commented Jun 24, 2024

AlbertBJ commented Jun 24, 2024

AlbertBJ commented Jun 26, 2024

AlbertBJ commented Jun 26, 2024

只使用deepspeed不使用torchrun 训练报错 #1188

只使用deepspeed不使用torchrun 训练报错 #1188

AlbertBJ commented Jun 20, 2024 •

edited

Loading