[DIPU] mock_cuda设为true后，torch.device导致guard检查失败造成重新编译的问题。 #607

Reinerzhou · 2024-01-08T07:53:18Z

背景

编译模式下，在燧原和华为机器上运行llama_finetune。

问题描述

在mock_cuda为true的情况下，若在模型运行前调用torch.set_default_device，分析timeline发现每次都会触发dynamo的compile，查看日志调查发现：

torch._dynamo.convert_frame.__recompiles: [DEBUG] ('Recompiling function _is_fp16_bf16_tensor in /home/cse/zhousl/accelerate/src/accelerate/utils/operations.py', "triggered by the following guard failure: utils_device.CURRENT_DEVICE == device(type='cuda', index=0)")

是utils_device.CURRENT_DEVICE == device(type='cuda', index=0) 检查失败导致每次都会重新编译。

分析发现：

调用torch.set_default_device时，会触发torch中的这段代码对CURRENT_DEVICE进行设置，这时候用到的torch.device为torch原本的api。
https://github.com/DeepLink-org/pytorch/blob/eb31e39bb07f00cf1c917a3bc83867981b2c04cd/torch/utils/_device.py#L57C1-L65C35

而在运行过程中对utils_device.CURRENT_DEVICE == device(type='cuda', index=0) 做检查时，调用的是dipu mock后的torch.device api，mock前后的torch.device调用结果并不一致。

deeplink.framework/dipu/torch_dipu/dipu/device.py

Lines 38 to 67 in 3ecb00b

 class _DIPUDevice(metaclass=_MetaDeviceType): 

 @staticmethod 

 def __replacedipu(arg): 

 if (__dipu__ in arg): 

 arg = arg.replace(__dipu__, __dipu_device_type__) 

 if (mockcuda and "cuda" in arg): 

 arg = arg.replace("cuda", __dipu_device_type__) 

 return arg 

 def __new__(cls, *args, **kwargs): 

 if len(args) == 1 and isinstance(args[0], int) and mockcuda: 

 # modify default int device type only when "mock cuda". 

 dev_name = __dipu_device_type__ + ":" + str(args[0]) 

 _device = _MetaDeviceType._torch_device(dev_name) 

 return _device 

 # handle device as str 

 if len(args) >= 1 and isinstance(args[0], str): 

 argList = list(args) 

 argList[0] = cls.__replacedipu(args[0]) 

 args = tuple(argList) 

 # handle parameter type: str, not support int type but str and device 

 deviceValue = kwargs.get("type", None) 

 if deviceValue != None and isinstance(deviceValue, str): 

 kwargs["type"] = cls.__replacedipu(deviceValue) 

 _device = _MetaDeviceType._torch_device(*args, **kwargs) 

 return _device 

 # always patch: device class is immutable, cannot directly patch __new__ method on python layer. 

 torch.device = _DIPUDevice

简单来说就是torch_dipu中缺少torch中从torch.set_default_device到CURRENT_DEVICE 的完整逻辑链路，且torch.device mock前后在做guards检查时并不一致。
可通过如下代码复现：
'import torch
t1 = torch.device(type='cuda', index=0)
import torch_dipu
t2 = torch.device(type='cuda', index=0)
print(t1 == t2) # 结果为False'
目前可通过不调用torch.set_default_device，或者在import torch之前先import torch_dipu(这样可以在设置CURRENT_DEVICE前就让dipu mock掉torch.device api)来避免guards检查失败导致重编这个问题。

Co-authored-by: shanhang <[email protected]>

NeosZhang pushed a commit to DeepLink-org/deeplink.framework.dev that referenced this issue Jan 18, 2024

fix conform_test bytes0 (DeepLink-org#607)

cb9908e

Co-authored-by: shanhang <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DIPU] mock_cuda设为true后，torch.device导致guard检查失败造成重新编译的问题。 #607

[DIPU] mock_cuda设为true后，torch.device导致guard检查失败造成重新编译的问题。 #607

Reinerzhou commented Jan 8, 2024

[DIPU] mock_cuda设为true后，torch.device导致guard检查失败造成重新编译的问题。 #607

[DIPU] mock_cuda设为true后，torch.device导致guard检查失败造成重新编译的问题。 #607

Comments

Reinerzhou commented Jan 8, 2024

背景

问题描述