-
Notifications
You must be signed in to change notification settings - Fork 224
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
只使用deepspeed不使用torchrun 训练 报错 #1188
Comments
在脚本中 添加了 因为ds启动训练 会为每个进程 的启动参数 都增加 一个 local_rank的参数,我看 swift中,会对 这块 校验,为什么呢? |
又加了 ignore_args_error=True的参数,不报错了,看看能不能run |
保证DDP数量和可见显卡数量一致,在DDP不等于显卡数量时,swift会使用device_map,这个技术和deepspeed不兼容 |
会支持 deepspeed么?还有就是 我现在在 单机上测试,ds,可以run起来,就是 把 LOCAL_WORLD_SIZE设置成 可见卡的数量,后续我再测试下 多机多卡 |
多机多卡的时候,还是使用ds更方便一些 |
使用ds 多机多卡 启动失败 ,就算 我强制 把每个机器中环境变量LOCAL_WORLD_SIZE设置为可见卡数量,还是 报 DeepSpeed is not compatible with MP;看来只能使用 torchrun方式了 |
后续会支持 ds么?感觉还是 ds好用 |
不使用torch的ddp分布式,因为需要在每个机器上都执行脚本,如果 使用deepspeed的话,只需要在 master节点执行就行,
使用 torch ddp跑以下脚本
nproc_per_node=2
export PYTHONPATH=../../..
export CUDA_VISIBLE_DEVICES=0,1
torchrun
--nproc_per_node=$nproc_per_node
--master_port 29500
llm_sft.py
省略参数
--deepspeed xxxx.json
可以跑成功;
但是 去除torchrun 只使用deepspeed就不行,目前还是在 单机双卡 测试:
export PYTHONPATH=../../..
export CUDA_VISIBLE_DEVICES=0,1
deepspeed llm_sft.py
其他参数和上面一致, 运行报错:
File "/mnt/swift/examples/pytorch/llm/llm_sft.py", line 7, in
output = sft_main()
File "/mnt/swift/swift/utils/run_utils.py", line 21, in x_main
args, remaining_argv = parse_args(args_class, argv)
File "/mnt/swift/swift/utils/utils.py", line 114, in parse_args
args, remaining_args = parser.parse_args_into_dataclasses(argv, return_remaining_strings=True)
File "/opt/conda/lib/python3.10/site-packages/transformers/hf_argparser.py", line 338, in parse_args_into_dataclasses
obj = dtype(**inputs)
File "", line 172, in init
File "/mnt/swift/swift/llm/utils/argument.py", line 883, in post_init
raise ValueError('DeepSpeed is not compatible with MP. '
ValueError: DeepSpeed is not compatible with MP. n_gpu: 2, local_world_size: 1.
[INFO:swift] Using DDP + MP(device_map)
[INFO:swift] Using DDP + MP(device_map)
[INFO:swift] Successfully registered
/mnt/swift/swift/llm/data/dataset_info.json
torch:2.1.2
swift版本:2.2.0.dev0
transformers:4.38.2
deepspeed info ................... 0.12.6
torch cuda version ............... 12.1
The text was updated successfully, but these errors were encountered: