You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
torch._dynamo.convert_frame.__recompiles: [DEBUG] ('Recompiling function _is_fp16_bf16_tensor in /home/cse/zhousl/accelerate/src/accelerate/utils/operations.py', "triggered by the following guard failure: utils_device.CURRENT_DEVICE == device(type='cuda', index=0)")
背景
编译模式下,在燧原和华为机器上运行llama_finetune。
问题描述
在mock_cuda为true的情况下,若在模型运行前调用torch.set_default_device,分析timeline发现每次都会触发dynamo的compile,查看日志调查发现:
torch._dynamo.convert_frame.__recompiles: [DEBUG] ('Recompiling function _is_fp16_bf16_tensor in /home/cse/zhousl/accelerate/src/accelerate/utils/operations.py', "triggered by the following guard failure: utils_device.CURRENT_DEVICE == device(type='cuda', index=0)")
是utils_device.CURRENT_DEVICE == device(type='cuda', index=0) 检查失败导致每次都会重新编译。
分析发现:
调用torch.set_default_device时,会触发torch中的这段代码对CURRENT_DEVICE进行设置,这时候用到的torch.device为torch原本的api。
https://github.com/DeepLink-org/pytorch/blob/eb31e39bb07f00cf1c917a3bc83867981b2c04cd/torch/utils/_device.py#L57C1-L65C35
而在运行过程中对utils_device.CURRENT_DEVICE == device(type='cuda', index=0) 做检查时,调用的是dipu mock后的torch.device api,mock前后的torch.device调用结果并不一致。
deeplink.framework/dipu/torch_dipu/dipu/device.py
Lines 38 to 67 in 3ecb00b
简单来说就是torch_dipu中缺少torch中从torch.set_default_device到CURRENT_DEVICE 的完整逻辑链路,且torch.device mock前后在做guards检查时并不一致。
可通过如下代码复现:
'import torch
t1 = torch.device(type='cuda', index=0)
import torch_dipu
t2 = torch.device(type='cuda', index=0)
print(t1 == t2) # 结果为False'
目前可通过不调用torch.set_default_device,或者在import torch之前先import torch_dipu(这样可以在设置CURRENT_DEVICE前就让dipu mock掉torch.device api)来避免guards检查失败导致重编这个问题。
The text was updated successfully, but these errors were encountered: